Background
I want to get HTML contents from a web site, then parse it as a HTML and extract some contents from parsed HTML DOM with PowerShell.
Invoke-WebRequest can get HTML from a URI, and Microsoft.PowerShell.Commands.HtmlWebResponseObject#ParsedHtml() can parse HTML into DOM. But if the responce doesn't contain charset header, and the HTML contains non-ASCII characters, ParsedHtml() will collapses non-ASCII characters.
Problem
When you want to get HTML content with proper encoding, you can convert the HtmlWebResponseObject#Content into a HTML string like this.
$RawContent = Invoke-WebRequest -Method Get -Uri https://kikakurui.com/x0/X0001-1994-01.html
$HtmlString = [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::GetEncoding("ISO-8859-1").GetBytes($RawContent.Content))
But when you try to get DOM from the HTML string, [xml]$HtmlString will fail if the HTML content is not a valid XML.
PS C:\tmp> [xml]$HtmlString
Cannot convert value "<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja">
(snip)
</body>
" to type "System.Xml.XmlDocument". Error: "'src' is an unexpected token. The expected token is '='. Line 38, position
15."
At line:1 char:1
+ [xml]$HtmlString
+ ~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [], RuntimeException
+ FullyQualifiedErrorId : InvalidCastToXmlDocument
On the other hand, HtmlWebResponseObject#ParsedHtml() can parse an HTML even if the content is not a valid XML, but there is no way to pass a string object into it.
Question
Is there any way to parse non-valid HTML strings in a variable into DOM with PowerShell? The out-of-the-box features of PowerShell are preferable.
Edit
The out-of-the-box features of PowerShell are preferable because we have to use a restricted VDI environment (we have to ask permission to install additional software) to do this work.