Why my PowerShell script not run as expected

Question

I have created a script to crawl the IMDB website. My script take a list of IMDB urls, run and extract the data like movie title, release year, plot summary and export it to a text file in CSV. I wrote the script as below.

$listToCrawl =  "imdb_link_list.txt"
$pathOfFile = "K:\MY DOCUMENTS\POWERSHELL\IMDB FILE\"
$fileName = "plot_summary.txt"
New-Item ($pathOfFile + $fileName) -ItemType File
Set-Content ($pathOfFile + $fileName) '"Title","Year","URL","Plot Summary"'

Get-Content ($pathOfFile + $listToCrawl) | ForEach-Object {
 $url = $_
$Result =  Invoke-WebRequest -Uri $url

$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"
$movieTitleNode = $Result.ParsedHtml.querySelector( $movieTitleSelector)
$movieTitle = $movieTitleNode.innerText

$movieYearSelector = "#titleYear"
$movieYearNode = $Result.ParsedHtml.querySelector($movieYearSelector)
$movieYear = $movieYearNode.innerText

$plotSummarySelector = "#titleStoryLine > div:nth-child(3) > p > span"
$plotSummaryNode = $Result.ParsedHtml.querySelector($plotSummarySelector)
$plotSummary = $plotSummary.innerText
$movieDataEntry = '"' + $movieTitle + '","' + $movieYear + '","' + $url + '","' + $plotSummary + '"'
Add-Content ($pathOfFile + $fileName) $movieDataEntry
}

The list of urls to extract from is saved in the "K:\MY DOCUMENTS\POWERSHELL\IMDB FILE\imdb_link_list.txt" file and the content is as below.

https://www.imdb.com/title/tt0472033/
https://www.imdb.com/title/tt0478087/
https://www.imdb.com/title/tt0285331/
https://www.imdb.com/title/tt0453562/
https://www.imdb.com/title/tt0120577/
https://www.imdb.com/title/tt0416449/

I just import and run the script. It does not run as expected. The error is threw.

Invalid argument.
At K:\MY DOCUMENTS\POWERSHELL\IMDB_Plot_Summar_ Extract.ps1:20 char:1
+ $plotSummaryNode = $Result.ParsedHtml.querySelector($plotSummarySelec ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

I think the problem is due to the CSS selector I use to select the data but I don't know what's wrong. I think I have followed the CSS selector rule.

$plotSummarySelector = "#titleStoryLine > div:nth-child(3) > p > span"

Does anyone know what's wrong with the thing.

can't say anything without seeing the HTML structure, I guess. — Kevin
– Kevin, Commented Mar 3, 2020 at 5:48
You can do it just your self I have provided the links to the pages above. Just input it in Google Chrome and view source. — MaydayUniversal
– MaydayUniversal, Commented Mar 3, 2020 at 5:55
Why are you using this div:nth-child(3) ? There is a class of the div you can use! — Kevin
– Kevin, Commented Mar 3, 2020 at 6:00
Well that is what I get when I choose element > copy selector when I inspect it on Google Chrome. What is your suggestion? — MaydayUniversal
– MaydayUniversal, Commented Mar 3, 2020 at 6:53
My suggestion is use the div's class instead :nth-child(3) — Kevin
– Kevin, Commented Mar 3, 2020 at 7:07

mclayton · Accepted Answer · 2020-03-03 09:57:16Z

The ParsedHtml property is specific to PowerShell for Windows and doesn't exist in PowerShell Core, so if you want to future-proof your code you're better off using something like the HTML Agility Pack.

# install the HTML Agility Pack nuget package
Invoke-WebRequest -Uri "https://dist.nuget.org/win-x86-commandline/latest/nuget.exe" -OutFile ".\nuget.exe"; 
.\nuget.exe install "HtmlAgilityPack" -Version "1.11.21";

# import the HTML Agility Pack
Add-Type -Path ".\HtmlAgilityPack.1.11.21\lib\Net40\HtmlAgilityPack.dll";

# get the web page content and load it into a HtmlDocument
$response = Invoke-WebRequest -Uri "https://www.imdb.com/title/tt0472033/" -UseBasicParsing;
$html = $response.Content;
$doc = new-object HtmlAgilityPack.HtmlDocument;
$doc.LoadHtml($html);

then you can extract nodes using XPath syntax - e.g. for the title:

# extract the title
$titleHtml = $doc.DocumentNode.SelectSingleNode("//div[@class='title_wrapper']/h1/text()[1]").InnerText;
$titleText = [System.Net.WebUtility]::HtmlDecode($titleHtml).Trim();
write-host "'$titleText'"; # '9'

I'll leave the rest of the document elements as an exercise for the reader :-).

Collectives™ on Stack Overflow

Why my PowerShell script not run as expected

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related