1

I am trying to extract html from a text file. Here's what it looks like:

<html>

<cool class="what"> idk1 </cool>
<lame id="hm">idc1 </lame>
<lame id="hm"> idc2 </lame>
<lame id="hm"> idc3</lame>
<lame id="hm"> idc4 </lame>
<confused id="allTheTime"> abc1 </confused>

<cool class="what"> idk2 </cool>
<lame id="hm"> </lame>
<lame id="hm"> idc2 </lame>
<lame id="hm">  </lame>
<lame id="hm"> idc4 </lame>
<confused id="allTheTime"> abc2 </confused>

<cool class="what"> idk3 </cool>
<confused id="allTheTime"> abc3 </confused>

</html>

Below is my code:

$html = Get-Content -path 'C:\Users\bob\Desktop\tester.txt' -Raw
$wantedData1 = ($html | select-string '(?<=<cool class="what">\s+)(.*?)(?=\s+</cool>)' -allMatches | foreach {$_.Matches} | Foreach {$_.Value})
$wantedData2 = ($html | select-string '(?<=<lame id="hm">\s+)(.*?)(?=\s+</lame>)' -allMatches | foreach {$_.Matches} | Foreach {$_.Value})
$wantedData3 = ($html | select-string '(?<=<confused id="allTheTime">\s+)(.*?)(?=\s+</confused>)' -allMatches | foreach {$_.Matches} | Foreach {$_.Value})
write-host $wantedData1
write-host $wantedData2
write-host $wantedData3

The output looks like this:

idk1 idk2 idk3
idc2 idc4 idc2  idc4
abc1 abc2 abc3

I am trying to write something thats gives me an output like this:

idk1
idc1
idc2
idc3
idc4
abc1

idk2
idc2
idc4
abc2

idk3
abc3

The data for the <cool> and <confused> tag occur one time for each iteration but the values of the <lame> tag may not exist or there may be between 1 to 5 <lame> tags. I mention this because one of my other queries would break if the tag was null. Any help would be greatly appreciated. Thanks.

1 Answer 1

1

It looks like your HTML text is also valid XML, which makes parsing easier:

# Simulate reading from an XML file.
# To read from an actual file, use:
#    ($xmlDoc = [xml]::new()).Load((Convert-Path file.xml))
[xml] $xmlDoc = @'
<html>

<cool class="what"> idk1 </cool>
<lame id="hm">idc1 </lame>
<lame id="hm"> idc2 </lame>
<lame id="hm"> idc3</lame>
<lame id="hm"> idc4 </lame>
<confused id="allTheTime"> abc1 </confused>

<cool class="what"> idk2 </cool>
<lame id="hm"> </lame>
<lame id="hm"> idc2 </lame>
<lame id="hm">  </lame>
<lame id="hm"> idc4 </lame>
<confused id="allTheTime"> abc2 </confused>

<cool class="what"> idk3 </cool>
<confused id="allTheTime"> abc3 </confused>

</html>
'@

# Get the inner text from all <html> child nodes and
# trim surrounding whitespace from each.
$xml.html.ChildNodes.InnerText.Trim()

The above yields the following (which doesn't fully match what you state in your question, but I presume is what you meant); you can capture it in an array of string simply by prepending something like $strings = to the foreach statement:

idk1
idc1
idc2
idc3
idc4
abc1
idk2

idc2

idc4
abc2
idk3
abc3
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.