1

How would you load a web page in PowerShell that requires JavaScipt? On a page that requires JavaScript, if you use Invoke-WebRequest, you get an error such as, "Please enable JS and disable any ad blocker".

For example, this will return that error.

$WebResponse = Invoke-WebRequest -UseBasicParsing "https://www.microcenter.com/"

I also tried using PSParseHTML module which returns the same error.

$htmlDom = ConvertFrom-HtmlTable -Engine AngleSharp -Url "https://www.microcenter.com/"

In this case, the website is returning the error because they are trying to stop you from using an ad blocker. However, the real need here is to fully load a web page because most web pages these days don't return what you're looking for unless its scripts are run.

Thanks, Brian

0

4 Answers 4

1
  • Web scraping via Invoke-WebRequest / Invoke-RestMethod works only with static content in the target page (i.e. with the raw HTML source code).[1]

  • To support extracting content that gets loaded dynamically, via JavaScript, you need a full web browser that you can control programmatically.

  • As you've discovered yourself, Chromium-based browsers do offer a CLI method of outputting the dynamically generated / augmented HTML, as it would render in interactively in a browser, using the --headless and --dump-dom options.

  • You can capture this HTML in a variable and then process it via an HTML parser such as provided by the AngleSharp .NET library, as offered via the PSParseHTML module, which also wraps the HTML Agility Pack (which is used by default).


The following is self-contained sample code:

  • It assumes that you're running Windows 11 with the modern, Chromium-based version of Microsoft Edge, located at:

    "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
    
    • Alternatively, you can download a different Chromium-based browser, such as Brave, or Google Chrome, whose executable you can then find at:

      "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
      
  • It downloads HTML from sample URL http://www.nptcstudents.co.uk/andrewg/jsweb/dynamicpages.html, which dynamically fills in various elements using client-side JavaScript, including one with the current timestamp.

  • It then ensures that the PSParseHTML module is installed and uses it to parse the rendered HTML, and extracts the element that was dynamically populated with the current timestamp to verify that client-side rendering was indeed performed.

# Create an 'msedge' alias for Microsoft Edge.
Set-Alias msedge 'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'

# Sample URL that includes dynamic content.
$url = 'http://www.nptcstudents.co.uk/andrewg/jsweb/dynamicpages.html'

# Use Microsoft Edge in headless mode to download from the URL
# and run its client-side scripts.
# Note:
#  * --disable-gpu prevents any GPU-related errors from appearing in the output.
#  * ... | Out-String captures all output as a *single, multiline string*
#    and additionally ensures *synchronous* execution on Windows,
#    which in turn enables capturing the output.
#  * Since a full web browser must be launched, as a child process,
#    followed by downloading and rendering a web page, this takes
#    a while, especially if the browser isn't already running.
Write-Verbose -Verbose "Downloading and rendering $url..."
$dynamicHtml =
  msedge --headless --dump-dom --disable-gpu $url | Out-String

# Now you can use the PSParseHTML module to parse the captured HTML.
# Install the module on demand.
if (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
  Write-Verbose "Installing PSParseHTML module for the current user..."
  Install-Module -ErrorAction Stop -Scope CurrentUser PSParseHTML
}

# Parse the HTML.
Write-Verbose -Verbose "Parsing the rendered HTML..."
$parsedHtml = ConvertFrom-Html -Engine AngleSharp -Content $dynamicHtml

# Now extract the dynamically populated element to verify that it contains the current timestamp.
Write-Verbose -Verbose "Extracting a dynamically populated element..."
$parsedHtml.QuerySelectorAll('div.exampleblock')[1].InnerHtml

The above should print something like (note the timestamp):


<script type="text/javascript">
    document.write("The date is " + Date());
</script>The date is Tue Jan 16 2024 23:48:55 GMT-0500 (Eastern Standard Time)
            

[1] In the legacy, Windows-only, ships-with-Window Windows PowerShell edition, Invoke-WebRequest by default (unless -UseBasicParsing is passed) does return dynamically generated HTML, by using the obsolete Internet Explorer engine behind the scenes - see this answer for an example.
In PowerShell (Core) 7+, the modern, cross-platform, install-on-demand edition, -UseBasicParsing is invariably implied, meaning that the raw HTML source code is only ever downloaded.
However, as of Windows 11, you can still emulate the Windows PowerShell behavior via the InternetExplorer.Application COM object; here's a minimal example:
$ie = New-Object -ComObject InternetExplorer.Application; $ie.Navigate2('https://example.org'); while ($ie.Busy) { Start-Sleep -Milliseconds 200 }; $ie.Document.getElementsByTagName('p') | ForEach-Object outerText
Either way, the obsolete status of Internet Explorer makes such solutions increasingly unusable, and an PowerShell-external dynamic-HTM-loading and HTML-parsing solutions are needed.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, this works great! When I use this method with Edge it stores the stdout to a variable that I can inspect. Now I've discovered the webpage I was looking at is detecting the headless browser and changing the output, but that is a different issue.
Glad to hear it, @gpburdell. Interesting that the target site / its client-side JavaScript would detect the headless operation - do you know how that is done, and of a way around that?
I believe recaptcha is detecting it and my guess is that's on purpose as they don't want people scraping the website. I don't know a way around that, I would guess it would be a constant battle if you did, recaptcha makes a living trying to make sure a real user is interacting with the browser.
0

This likely isn't what you want to hear, but you might have to switch languages, or create an executable that returns what you want to PowerShell. Consider using NodeJS to load and parse the complete webpage, extract the information, and return it to PowerShell. Check out this answer for more info: https://stackoverflow.com/a/44005813/23226464

12 Comments

Thanks, I don't mind learning a new language if I need to. I've done a ton of C and C# programming in the distant past and can pick up new languages quickly if I need to. I just happen to be more familiar with PowerShell these days so would prefer to be able to call something from powershell that will grab and return html. It sounds like NodeJS and Chrome may be the easiest way to do this?
@gpburdell Yeah, unfortunately. It does make sense, though, because JavaScript is the language of the web, so "advanced web stuff" works best in NodeJS.
I may have found a way to return html but can't quite get it into a variable. Chrome has a command to dump the html but it seems to do it async and so my variable is empty. Any idea how to capture what goes to stdout async? $chromeOut = chrome --headless --disable-gpu --dump-dom google.com
@gpburdell Nvm, changed it to a real url (https://www.google.com). Take a look at this and see if it works. I'm not a PS expert. stackoverflow.com/a/24371479/23226464
Thank you, that may work, I didn't know you could register an event handler in PS but I'm just a PS layman. I'll give this a try and see if I can get it to work, thanks!
|
0

I was not able to extract a table from this site: https://www.cropex.hr/en/market-data/day-ahead-market/day-ahead-market-results.html

I have customized the script from above in the following way:

$outputFile = 'D:\cropex\result.html'
$page = $parsedHtml.QuerySelectorAll('table')[1].OuterHtml
$page | Set-Content -Path $outputFile

Any thoughts?

1 Comment

If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
0

Try It:

# Path to chrome.exe    
$Chrome = "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
$Opt = '--headless=new --dump-dom --virtual-time-budget=10000 https://www.yoursite.com'
    
Start-Process -FilePath $Chrome -ArgumentList $Opt -WindowStyle Hidden -PassThru -Wait

The page result code go to Powershell console.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.