0

I have created a script to crawl the IMDB website. My script take a list of IMDB urls, run and extract the data like movie title, release year, plot summary and export it to a text file in CSV. I wrote the script as below.

$listToCrawl =  "imdb_link_list.txt"
$pathOfFile = "K:\MY DOCUMENTS\POWERSHELL\IMDB FILE\"
$fileName = "plot_summary.txt"
New-Item ($pathOfFile + $fileName) -ItemType File
Set-Content ($pathOfFile + $fileName) '"Title","Year","URL","Plot Summary"'

Get-Content ($pathOfFile + $listToCrawl) | ForEach-Object {
 $url = $_
$Result =  Invoke-WebRequest -Uri $url

$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"
$movieTitleNode = $Result.ParsedHtml.querySelector( $movieTitleSelector)
$movieTitle = $movieTitleNode.innerText

$movieYearSelector = "#titleYear"
$movieYearNode = $Result.ParsedHtml.querySelector($movieYearSelector)
$movieYear = $movieYearNode.innerText

$plotSummarySelector = "#titleStoryLine > div:nth-child(3) > p > span"
$plotSummaryNode = $Result.ParsedHtml.querySelector($plotSummarySelector)
$plotSummary = $plotSummary.innerText
$movieDataEntry = '"' + $movieTitle + '","' + $movieYear + '","' + $url + '","' + $plotSummary + '"'
Add-Content ($pathOfFile + $fileName) $movieDataEntry
}

The list of urls to extract from is saved in the "K:\MY DOCUMENTS\POWERSHELL\IMDB FILE\imdb_link_list.txt" file and the content is as below.

https://www.imdb.com/title/tt0472033/
https://www.imdb.com/title/tt0478087/
https://www.imdb.com/title/tt0285331/
https://www.imdb.com/title/tt0453562/
https://www.imdb.com/title/tt0120577/
https://www.imdb.com/title/tt0416449/

I just import and run the script. It does not run as expected. The error is threw.

Invalid argument.
At K:\MY DOCUMENTS\POWERSHELL\IMDB_Plot_Summar_ Extract.ps1:20 char:1
+ $plotSummaryNode = $Result.ParsedHtml.querySelector($plotSummarySelec ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

I think the problem is due to the CSS selector I use to select the data but I don't know what's wrong. I think I have followed the CSS selector rule.

$plotSummarySelector = "#titleStoryLine > div:nth-child(3) > p > span"

Does anyone know what's wrong with the thing.

6
  • can't say anything without seeing the HTML structure, I guess. Commented Mar 3, 2020 at 5:48
  • You can do it just your self I have provided the links to the pages above. Just input it in Google Chrome and view source. Commented Mar 3, 2020 at 5:55
  • Why are you using this div:nth-child(3) ? There is a class of the div you can use! Commented Mar 3, 2020 at 6:00
  • Well that is what I get when I choose element > copy selector when I inspect it on Google Chrome. What is your suggestion? Commented Mar 3, 2020 at 6:53
  • My suggestion is use the div's class instead :nth-child(3) Commented Mar 3, 2020 at 7:07

1 Answer 1

0

The ParsedHtml property is specific to PowerShell for Windows and doesn't exist in PowerShell Core, so if you want to future-proof your code you're better off using something like the HTML Agility Pack.

# install the HTML Agility Pack nuget package
Invoke-WebRequest -Uri "https://dist.nuget.org/win-x86-commandline/latest/nuget.exe" -OutFile ".\nuget.exe"; 
.\nuget.exe install "HtmlAgilityPack" -Version "1.11.21";

# import the HTML Agility Pack
Add-Type -Path ".\HtmlAgilityPack.1.11.21\lib\Net40\HtmlAgilityPack.dll";

# get the web page content and load it into a HtmlDocument
$response = Invoke-WebRequest -Uri "https://www.imdb.com/title/tt0472033/" -UseBasicParsing;
$html = $response.Content;
$doc = new-object HtmlAgilityPack.HtmlDocument;
$doc.LoadHtml($html);

then you can extract nodes using XPath syntax - e.g. for the title:

# extract the title
$titleHtml = $doc.DocumentNode.SelectSingleNode("//div[@class='title_wrapper']/h1/text()[1]").InnerText;
$titleText = [System.Net.WebUtility]::HtmlDecode($titleHtml).Trim();
write-host "'$titleText'"; # '9'

I'll leave the rest of the document elements as an exercise for the reader :-).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.