4

I'm a PowerShell and XPath beginner struggling to efficiently parse through some XML and build up an array of objects for further processing (e.g. CSV output, SQL Server load). A sample of the XML is included below as well as the code snippet that I'm currently using. In this schema, each object-array represents a single row in the desired output. I'm parsing the MetaData children to get the proper names of the columns, then building a collection of PSObjects where each object in the array represents a single row. The MetaData information is used to find the column names (PSObject properties).

This works fine for files with 10K rows or so, but bogs down horribly when run against my largest files with over 500K rows. In these cases each row is taking around 3-4 seconds to process. At 500K rows, that's a looong time to run. Is there some magic around XPath or PS variable assignment that I can use to speed this up?

The immediate need to is translate this XML into a CSV (currently performed via export-csv), but I'd prefer to have this portion of the script generate a collection of objects as I'll next be looking to either load this data into a SQL Server instance or do other processing.

Thanks for the help!

David

Sample XML

<Report>
<Data>
<Columns>
<MetaData>
<Index>0</Index>
<Name>Column1</Name>
<Index>1</Index>
<Name>Column2</Name>
<Index>2</Index>
<Name>Column3</Name>
</MetaData>
</Columns>
<Rows>
<object-array>
<string>column1 value</string>
<int>column2 value</string>
<string>column3 value</string>
</object-array>
</Rows>
</Data>
</Report>

Sample Code

#extract the column headers
[string[]]$ColumnHeaders = @()
$obj.SelectNodes("/Report/Data/Columns/MetaData") |% {$ColumnHeaders += $_.name}

$collection = @()
$rowint = 0
$rowcount = $obj.Report.Data.Rows."object-array".count

#unwind the rows
do {
    $hash=@{}

    #loop through each element in the row parent element and add it to the hash
    $columnint = 0
    $columncount = (Select-Xml -xPath "Report/Data/Rows/object-array[$rowint]/node()" $obj).count
        do {
            $hash.Add($columnheaders[$columnint], (Select-Xml -xPath "Report/Data/Rows/object-array[$rowint]/descendant::text()[$columnint]" $obj).Node.Value)
            $columnint++
        } while ($columnint -lt $columncount)


    $thisrow = New-Object PSObject -Property $hash 

    #add this new row to the collection 
    $collection += $thisrow 
    $rowint++
} while ($rowint -lt $rowcount)

1 Answer 1

1

You can get MetaData names without re-creating ColumnHeaders in each itreation:

$ColumnHeaders = $obj.Report.Data.Columns.MetaData.Name

Same applies to $collection. How the end result of your code looks like?

UPDATE: Give this a try

[xml]$obj = Get-Content test.xml

$data = $obj.Report.Data

$pso = New-Object PSObject
$pso | Add-Member NoteProperty -Name $data.Columns.MetaData.Name[0] -Value $data.Rows.'object-array'.string[0]
$pso | Add-Member NoteProperty -Name $data.Columns.MetaData.Name[1] -Value $data.Rows.'object-array'.int
$pso | Add-Member NoteProperty -Name $data.Columns.MetaData.Name[2] -Value $data.Rows.'object-array'.string[1] -PassThru
Sign up to request clarification or add additional context in comments.

8 Comments

$obj.Report.Data.Columns.MetaData.Name returns nothing, while $obj.Report.Data.Columns.MetaData |gm shows I'm getting back XMLElements, which have the Name property associated. The end result is an array $collection of objects that can then be piped to export-csv, ft, or other PS processing.
Can you include a sample output?
The output is a collection of PSObjects, having properties that correspond to the column headers and values that correspond to the row (object-array). In the sample XML, the result would be a single object with the following property/value pairs: Column1="Column1 Value", Column2="Column2 Value", Column3="Column3 Value". In the case of processing live data, there would be a PS array of between 10,000-500,000 of these objects, which could then be pipped to export-csv, a dataset for SQL Server loading, or processed further straight in PS.
To clarify, each <object-array> group is the equivalent of a row in a spreadsheet. I'm trying to parse the MetaData group to get the proper column headings (only happens once at the start of the script) and then populate all the rows (PSObjects) with the data found in all of the object-array elements, of which there are several thousand object-array elements in a given XML file.
Updated my answer, give it a try
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.