1

With Powershell, I can get the table with this - $URL = "http://example.com/yyy.htm" $OutputFile = "$env:temp\tempfile.xml"

# reading website data:
$data = Invoke-WebRequest -Uri $URL 

# get the first table found on the website and write it to disk:
@($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | Set-Content -Path $OutputFile

Now I want this table to be converted to CSV... How do I do that?

Table example -

Datacenter | FirstDNS | SecondDNS | ThirdDNS | FourthDNS
-----------------------------------------------------------
NewYork    | 1.1.1.1  | 2.2.2.2   |3.3.3.3   | 4.4.4.4
India      | 1.2.3.4  | 3.2.6.5   |8.2.3.7   | 8.3.66.1
2
  • NB: Set-Content only creates a text file with the extension .xml; it's not necessarily a real xml file. I guess you're just writing to an xml file to do the conversion; for that you can simply do it by converting the string to xml; [xml]$myXml = @($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | out-string. NB: This will not be robust; if you're not using XHTML there's no guarentee that the HTML will obey XML's syntax rules (e.g. HTML5 has void elements, so there <hr> should not have a closing tag; whilst in XML it must be iether <hr /> or <hr></hr>. Commented Aug 11, 2015 at 19:10
  • Ref stackoverflow.com/questions/3558119/… Commented Aug 11, 2015 at 19:10

2 Answers 2

3

Here's a solution convert HTML tables to PSObjects, which you can then pipe to Export-CSV or do whatever you need to. Please note: this is not a clean solution; it just about does the job for simple scenarios, but has a lot of issues:

  • Can't cope with special characters (other than &nbsp;, to get it to work you'll need to add new definitions to the DocType's entity map as required)
  • Can't cope with colspan or rowspan; assumes that all tables have the same number of columns in every row as they had in the header (there's a tweak to prevent errors if there's more columns than headers; but you may still get misalignment in that scenario).
  • My technique for cleaning the HTML before converting to XML was to use a regex rather than a parsing library; so there could well be unexpected issues there.
function ConvertFrom-HtmlTableRow {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTableRow
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        $headers
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        [switch]$isHeader

    )
    process {
        $cols = $htmlTableRow | select -expandproperty td
        if($isHeader.IsPresent) {
            0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name        
        } else {
            $colCount = ($cols | Measure-Object).Count - 1
            $result = new-object -TypeName PSObject
            0..$colCount | %{
                $colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers 
                $colValue = $cols[$_]
                $result | Add-Member NoteProperty $colName $colValue
            } 
            write-output $result
        }
    }
}

function ConvertFrom-HtmlTable {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTable
    )
    process {
        #currently only very basic <table><tr><td>...</td></tr></table> structure supported
        #could be improved to better understand tbody, th, nested tables, etc

        #$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow

        #remove anything tags that aren't td or tr (simplifies our parsing of the data
        [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp '&#160;'>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'})) 
        [string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
        if ($headers.Count -gt 0) {
            $cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
        }
    }
}

clear-host

[System.Uri]$url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions' 
$rqst = Invoke-WebRequest $url 
$rqst.ParsedHtml.getElementsByTagName('table') | ConvertFrom-HtmlTable 

FYI: I've also published an earlier version of this code on CodeReview, so check there to see if anyone suggests any good improvements.

Sign up to request clarification or add additional context in comments.

Comments

0

Use Tee-Object to output the file, and you can pipe to Export-CSV after that:

@($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML | Tee-Object -FilePath $OutputFile | Export-CSV $env:temp\tempfile.csv -notype

5 Comments

thanks for the response...but I only get a file that has a row called "Length" followed by "139670" below it...no data :(
how can I add a sample?
Ah, so that doesn't grab something that can be easily converted to CSV. Sorry, I don't have an HTML table to use to test with so I went with an assumption. Can you post a link to a table that could be used as example? Or maybe pastebin something?
my heartiest Apologies...its an internal site...so I converted a small portion to something that I can share as an attachment but cant find the place to attach. How may I send it to you. You can contact me at [email protected] is utterly important to me...
@Moderator- can you please help? How do I attach my file for use?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.