0

I'm using an Invoke-WebRequest to grab user information from a table on a website. The information is coming back as such:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table border='0' cellspacing='0' cellpadding='2' width='640'><tr><td class='col1bold' width='10'>&nbsp;</td><td class='col1reversebold' width='10'>&nbsp;</td><td class='col1reversebold' width='5'>&nbsp;</td><td class='col1reversebold' width='130'>F
ull Name</td><td class='col1reversebold' width='80'>Label</td><td class='col1reversebold' width='70'>Call Status</td><td class='col1reversebold' width='60'>DND/Hold</td><td class='col1reversebold' width='125'>Location</td><td class='col1reversebold'
 width='80'>Current Site</td></tr><tr><td class='col1bold' width='30'>&nbsp;</td><td class='col1'><img src=/console/images/static/B3000NW.gif></td><td class='col1'>&nbsp;</td><td class='col1'>Testing, Steve</td><td class='col1'>OR-72</td><td class='
col1'>Inactive</td><td class='col1'>&nbsp;</td><td class='col1'>703a0e88c252</td><td class='col1'>Global</td></tr><tr><td class='col1bold' width='30'>&nbsp;</td><td class='col1'><img src=/console/images/static/B3000B.gif></td><td class='col1'>&nbsp;
</td><td class='col1'>GreenTest, Connie</td><td class='col1'>OR-24</td><td class='col1'>Inactive</td><td class='col1'>&nbsp;</td><td class='col1'>Left Nurse Station</td><td class='col1'>Global</td></tr></table>

If I use the following syntax:

$GroupRequests = Invoke-WebRequest -Uri "$URI" -WebSession $session -Method GET

$HTMLTable = $GroupRequests.ParsedHtml.body.InnerHTML

I get:

<TABLE cellSpacing=0 cellPadding=2 width=640 border=0>
<TBODY>
<TR>
<TD class=col1bold width=10>&nbsp;</TD>
<TD class=col1reversebold width=10>&nbsp;</TD>
<TD class=col1reversebold width=5>&nbsp;</TD>
<TD class=col1reversebold width=130>Full Name</TD>
<TD class=col1reversebold width=80>Label</TD>
<TD class=col1reversebold width=70>Call Status</TD>
<TD class=col1reversebold width=60>DND/Hold</TD>
<TD class=col1reversebold width=125>Location</TD>
<TD class=col1reversebold width=80>Current Site</TD></TR>
<TR>
<TD class=col1bold width=30>&nbsp;</TD>
<TD class=col1><IMG src="/console/images/static/B3000NW.gif"></TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>Testing, Steve</TD>
<TD class=col1>OR-72</TD>
<TD class=col1>Inactive</TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>703a0e987f72</TD>
<TD class=col1>Global</TD></TR>
<TR>
<TD class=col1bold width=30>&nbsp;</TD>
<TD class=col1><IMG src="/console/images/static/B3000B.gif"></TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>GreenTest, Connie</TD>
<TD class=col1>OR-24</TD>
<TD class=col1>Inactive</TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>Left Nurse Station</TD>
<TD class=col1>Global</TD></TR></TBODY></TABLE>

How do I basically parse each row (<TR>) and then get the information into columns and headers? I've tried so many options with no luck.

2

1 Answer 1

1

A bit strange your ParsedHtml seems to have an XML declaration, but you could try the function below:

function ConvertFrom-HtmlTable {
    [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
    param(
        [Parameter(Mandatory = $true, Position = 0)]
        [Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
        [int]$TableIndex = 0,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
        [string]$TableId,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
        [string]$TableName,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
        [string]$TableClassName,

        # if there are no <th> tags, by appending this switch, the code
        # uses the first row in the table as column headers.
        [switch]$FirstRowHasHeaders
   )

    # Extract the table out of the web request
    switch ($PSCmdlet.ParameterSetName) {
        'ById'    { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
        'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
        'ByName'  { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
        'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
    }
    $WebRequest.Dispose()
    $WebRequest.Finalize()  # free the COM objects

    if (!$table) {
        Write-Warning "Could not find the given table."
        return $null
    }

    # load the System.Web assembly to be able to decode HTML entities
    Add-Type -AssemblyName System.Web

    $headers = @()
    # Go through all of the rows in the table
    foreach ($row in $table.Rows) {
        $cells = @($row.Cells)
        # If there is a table header or if we should consider the first row 
        # to be the header row, remember its titles
        if ($FirstRowHasHeaders -or $cells[0].tagName -eq "TH") {
            $i = 0
            $headers = @($cells | ForEach-Object {
                $i++
                # decode HTML entities and double-up quotes that the value may contain
                $th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
                # if the table header is empty, create it
                if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
            })
            # we now have headers, set the switch to $false so the rest is parsed as data
            $FirstRowHasHeaders = $false
            # proceed with the next row
            continue
        }
        # if we haven't found any table headers, make up names "H1", "H2", etc.
        if(-not $headers) {
            $headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
        }

        # Now go through the cells in the the row. For each, try to find the
        # title that represents that column and create a hashtable mapping those
        # titles to content
        $hash = [Ordered]@{}
        for ($i = 0; $i -lt $cells.Count; $i++) {
            # decode HTML entities and double-up quotes that the value may contain
            $value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
            $th = $headers[$i]
            $hash[$th] = $value.Trim()
        }
        # And finally cast that hashtable to a PSCustomObject
        [PSCustomObject]$hash
    }
}

and use it like this:

$GroupRequests = Invoke-WebRequest -Uri $URI -WebSession $session -Method GET

# the table doesn't have a name or id attribute, so use its index (guessing its the first table)
# in your example the table also does not have <th> header tags, so append -FirstRowHasHeaders
# if you want the first data row to act as column headers or accept that headers are created as
# 'H1', 'H2', 'H3' etc. Cells containing only whitespace will always be translated to 'H<columnNumber>'
$users = ConvertFrom-HtmlTable -WebRequest $GroupRequests -TableIndex 0 -FirstRowHasHeaders

# show on screen
$users | Format-Table -AutoSize

# save as CSV file
$users | Export-Csv -Path 'X:\Somewhere\users.csv' -NoTypeInformation
Sign up to request clarification or add additional context in comments.

4 Comments

This worked very nicely. Looks like the first row contains the headers rather than having a 'Header' row but other then that, it grabs what I need! Thanks!
@Joe Yes, that's because your HTML doesn't have <th> tags, only <td>..
Would there be a way to edit the function to make the first row the headers instead of H1, H2, etc....?
@Joe Please see my edit. I have added switch -FirstRowHasHeaders to the code and by using that you can force the function to treat the first data row as column headers if the table (like yours) does not have <th> tags

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.