1

I'm using a powershell script to append data to the end of a bunch of files. Each file is a CSV around 50Mb (Say 2 millionish lines), there are about 50 files.

The script I'm using looks like this:

$MyInvocation.MyCommand.path

$files = ls *.csv 

foreach($f in $files) 
{
$baseName = [System.IO.Path]::GetFileNameWithoutExtension($f)
$year = $basename.substring(0,4)

Write-Host "Starting" $Basename

$r = [IO.File]::OpenText($f)
while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()
    $line + "," + $year | Add-Content $(".\DR_" + $basename + ".CSV")
}
$r.Dispose()

}

Problem is, it's pretty slow. It's taken about 12 hours to get through them. It's not super complex, so I wouldn't expect it to take that long to run. What could I do to speed it up?

2 Answers 2

3

Reading and writing a file row by row can be a bit slow. Maybe your antivirus is contributing to slowness as well. Use Measure-Command to see which parts of the script are the slow ones.

As a general advise, rather write a few large blocks instead of lots of small ones. You can achieve this by storing some content in a StringBuilder and appending its contents into the output file every, say, 1000 processed rows. Like so,

$sb = new-object Text.StringBuilder # New String Builder for stuff
$i = 1 # Row counter
while ($r.Peek() -ge 0) {
    # Add formatted stuff into the buffer
    [void]$sb.Append($("{0},{1}{2}" -f $r.ReadLine(), $year, [Environment]::NewLine ) )

    if(++$i % 1000 -eq 0){ # When 1000 rows are added, dump contents into file
      Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
      $sb = new-object Text.StringBuilder # Reset the StringBuilder
    }
}
# Don't miss the tail of the contents
Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
Sign up to request clarification or add additional context in comments.

2 Comments

That is a significant speed boost. Thanks.
However it does result in an extra newline everytime. I've gotten around that (just for now) by just not using the if section that dumps every 100 rows into a file and writing it all at the end.
0

Don't go into .NET Framework static methods and building up strings when there are cmdlets that can do the work on objects. Collect your data, add the year column, then export to your new file. You're also doing a ton of file I/O and that'll also slow you down.

This will probably require a little bit more memory. But it reads the whole file at once, and writes the whole file at once. It also assumes that your CSV files have column headings. But it's much easier for someone else to look at and understand exactly what's going on (write your scripts so they can be read!).

# Always use full cmdlet names in scripts, not aliases
$files = get-childitem *.csv;

foreach($f in $files) 
{
    #basename is a property of the file object in PowerShell, there's no need to call a static method
    $basename = $f.basename;
    $year = $f.basename.substring(0,4)

    # Every time you use Write-Host, a puppy dies
    "Starting $Basename";

    # If you've got CSV data, treat it as CSV data. PowerShell can import it into a collection natively.
    $data = Import-Csv $f;
    $exportData = @();
    foreach ($row in $data) {
# Add a year "property" to each row object
        $row |Add-Member -membertype NoteProperty -Name "Year" -Value $year;
# Export the modified row to the output file
        $row |Export-Csv -NoTypeInformation -Path $("r:\DR_" + $basename + ".CSV") -Append -NoClobber
    }
}

2 Comments

Thanks, some of the comments were very informative.However, my original script looked a lot like this. The reason I changed it is that it is far too memory hungry and far too slow. It is puzzling actually, running this script seems to use over a GB of memory for only a 50MB CSV file. Any idea why?
That amount of memory does seem excessive. I was essentially putting two copies of the file in memory, but it's still a ton of memory for that. I've now made an edit which will dump each row to disk serially, instead of collecting them all. It'll be a performance hit on the I/O, but memory usage should be less. You could also do a hybrid with the other answer, and collect X number of records at a time and then export those to file as a group.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.