Parse non-valid HTML String in a variable into DOM with PowerShell

Question

Background

I want to get HTML contents from a web site, then parse it as a HTML and extract some contents from parsed HTML DOM with PowerShell.

Invoke-WebRequest can get HTML from a URI, and Microsoft.PowerShell.Commands.HtmlWebResponseObject#ParsedHtml() can parse HTML into DOM. But if the responce doesn't contain charset header, and the HTML contains non-ASCII characters, ParsedHtml() will collapses non-ASCII characters.

Problem

When you want to get HTML content with proper encoding, you can convert the HtmlWebResponseObject#Content into a HTML string like this.

$RawContent = Invoke-WebRequest -Method Get -Uri https://kikakurui.com/x0/X0001-1994-01.html
$HtmlString = [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::GetEncoding("ISO-8859-1").GetBytes($RawContent.Content))

But when you try to get DOM from the HTML string, [xml]$HtmlString will fail if the HTML content is not a valid XML.

PS C:\tmp> [xml]$HtmlString
Cannot convert value "<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja">
(snip)
</body>
" to type "System.Xml.XmlDocument". Error: "'src' is an unexpected token. The expected token is '='. Line 38, position
15."
At line:1 char:1
+ [xml]$HtmlString
+ ~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvalidCastToXmlDocument

On the other hand, HtmlWebResponseObject#ParsedHtml() can parse an HTML even if the content is not a valid XML, but there is no way to pass a string object into it.

Question

Is there any way to parse non-valid HTML strings in a variable into DOM with PowerShell? The out-of-the-box features of PowerShell are preferable.

Edit

The out-of-the-box features of PowerShell are preferable because we have to use a restricted VDI environment (we have to ask permission to install additional software) to do this work.

iRon · Accepted Answer · 2021-06-21 09:20:54Z

1

Although the HTML syntax is based on the XML syntax, it is not compatible in many ways. Therefore, (in most cases) you can't use a XML parser to read it. Instead you need to use a HTML parser like the IHTMLDocument2 interface to manupulate the contained elements.
As an example:

$Uri = 'https://kikakurui.com/x0/X0001-1994-01.html'
$String = [System.Net.Webclient]::New().DownloadString($Uri)
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Document = New-Object -Com 'HTMLFile'
if ($Document.IHTMLDocument2_Write) {
    $Document.IHTMLDocument2_Write($Unicode)
} else {
    $Document.write($Unicode)
}
$Document.getElementById('page1-div').getElementsByClassName("ft01")[0].innerText

Yields:

本工業規格          JIS

edited Jun 21, 2021 at 9:20

answered Jun 21, 2021 at 7:25

iRon

24.4k10 gold badges60 silver badges107 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Guenther Schmitz Over a year ago

is there any other solution - as on server core/hyperv server there is no such thing as webclient (which AFAIK needs internet explorer to work) and the com object also does not exist on my test machine.

iRon Over a year ago

Apart from the "The out-of-the-box features of PowerShell are preferable.", there is no background on this requirement in the question. Anyways, I guess the best way forward is to install some packages as PowerShell Core (including .Net Core 3.1) and a 3rd party package as the HtmlAgilityPack. Otherwise, you will probably end up writing your own HTML parser...

SATO Yusuke Over a year ago

@iRon Thank you for your answer. Your result seems to drop the first character （It should be 日本工業規格）. Also, im my environment [System.Net.Webclient]::New().DownloadString() failed to detect text encoding ($String stored garbled string), so I had to use $RawContent = Invoke-WebRequest -Method Get -Uri $Uri

$String = [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::GetEncoding("ISO-8859-1").GetBytes($RawContent.Content))

instead. But your idea (use HTMLFile#write()) is exactly what I'm looking for. Thanks!

Collectives™ on Stack Overflow

Parse non-valid HTML String in a variable into DOM with PowerShell

Background

Problem

Question

Edit

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Background

Problem

Question

Edit

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related