0

I'm trying to pull data from an HTML source, to create a list of books and authors.

As each book has its own HTML page, I'm using regex method to get the information I require.

Using the following sample of code, I can successfully call $regexp to return the book title (eg. 'My First Cook Book') when I use

>> $regexp = '<title>listing for - (?<title>.*) \[.*\]'
>> $name = ($url | select-string $regexp -allmatches).matches
>> $name.groups[1].value
My First Cook Book

However, I cannot retrieve the Author using a similar method, and I'm assuming it must be due to the code being spread across multiple lines, or to the inclusion of non-textual characters.

>> $regex1 = '<td class="tboldc" width="170">&nbsp; Author:</td>
>> <td class="tnormg" width="*">&nbsp;(?<author>.*)</td>'

>> $name1 = ($url | select-string $regex1 -allmatches).matches
>> $name.groups[1].value
Cannot index into a null array.
At line:1 char:1
+ $name1.groups[1].value     

I would like to retrieve the author's name (in this case 'D Atherton')

Where am I going wrong?

I've tried placing double-quotes around the & characters ( "&" ) and to place my (?.*) at different locations along the code (which gets varying results, but only seems to be when a single line of source code is used). [I'm assuming I need both lines of code, so that I can determine the ' Author:' part of the code in the regex, and the desired result from the second line]

[Solved]

Thank you to all who suggested alternate ways of solving this one. I can finally say, however, that I think I've solved it whilst sticking to using Powershell regex.

I replaced the $regex1 line with

$regex1 = '(?s) Author:<\/td>(?<author>.*?)<\/td' 

And used the following line to give me my required Author name as a result:

$author = $name1.groups[1].value -creplace '^[^\;]*\;', '' 

Phew!

6
  • 1
    Why would you use regex instead of a dom parser? Commented Mar 25, 2023 at 16:57
  • 1
    See e.g.: Powershell regex multiple match per line Commented Mar 25, 2023 at 17:46
  • You're not showing how $url is populated. Unless it is a multi-line string, you won't be able to match across lines. Commented Mar 25, 2023 at 21:35
  • $url = Invoke-RestMethod -uri "samplewebpage.com/bookid" the other regex lines work ok, but this one has me stumped! Commented Mar 25, 2023 at 22:31
  • 1
    Regex is for Regular Expression and HTML is not regular. When you get nested HTML data Regex will not work due to the recursions. Use a HTML parser library instead. Commented Mar 26, 2023 at 8:50

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.