Need help improving performance of PowerShell delimited-text parsing script

Question

I have the need to parse through a large pipe-delimited file to count the number of records whose 5th column meets and doesn't meet my criteria.

PS C:\temp> gc .\items.txt -readcount 1000 | `
  ? { $_ -notlike "HEAD" } | `
  % { foreach ($s in $_) { $s.split("|")[4] } } | `
  group -property {$_ -ge 256} -noelement | `
  ft –autosize

This command does what I want, returning output like this:

  Count Name
  ----- ----
1129339 True
2013703 False

However, for a 500 MB test file, this command takes about 5.5 minutes to run as measured by Measure-Command. A typical file is over 2 GB, where waiting 20+ minutes is undesirably long.

Do you see a way to improve the performance of this command?

For example, is there a way to determine an optimum value for Get-Content's ReadCount? Without it, it takes 8.8 minutes to complete the same file.

Have you tried StreamReader? I think that Get-Content loads the whole file into memory before it does anything with it. — Gisli
– Gisli, Commented Jan 17, 2012 at 21:52
Yeah, use the .net framework if you can. I used to to read large log files that SQL Server generates with good results. I don't know any other way in powershell to read large files efficiently but I'm no expert. — Gisli
– Gisli, Commented Jan 17, 2012 at 22:08
@Gisli, if you write you comment as an answer, I can upvote it and eventually accept it. Using StreamReader allowed me to get the time to 1 minute for the test file. — neontapir
– neontapir, Commented Jan 17, 2012 at 22:40

Gisli · Accepted Answer · 2012-01-17 22:52:58Z

4

Have you tried StreamReader? I think that Get-Content loads the whole file into memory before it does anything with it.

StreamReader class

answered Jan 17, 2012 at 22:52

Gisli

7522 gold badges12 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

neontapir · Accepted Answer · 2012-01-17 23:11:26Z

4

Using @Gisli's hint, here's the script I ended up with:

param($file = $(Read-Host -prompt "File"))
$fullName = (Get-Item "$file").FullName
$sr = New-Object System.IO.StreamReader("$fullName")
$trueCount = 0; 
$falseCount = 0; 
while (($line = $sr.ReadLine()) -ne $null) {
      if ($line -like 'HEAD|') { continue }
      if ($line.split("|")[4] -ge 256) { 
            $trueCount++
      }
      else {
            $falseCount++
      }
}
$sr.Dispose() 
write "True count:   $trueCount"
write "False count: $falseCount"

It yields the same results in about a minute, which meets my performance requirements.

answered Jan 17, 2012 at 23:11

neontapir

4,7643 gold badges40 silver badges53 bronze badges

Comments

Andy Arismendi · Accepted Answer · 2012-01-18 00:45:26Z

2

Just adding another example using StreamReader to read through a very large IIS log file and outputting all unique client IP addresses and some perf metrics.

$path = 'A_245MB_IIS_Log_File.txt'
$r = [IO.File]::OpenText($path)

$clients = @{}

while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()

    # String processing here...
    if (-not $line.StartsWith('#')) {
        $split = $line.Split()
        $client = $split[-5]
        if (-not $clients.ContainsKey($client)){
            $clients.Add($client, $null)
        }
    }
}

$r.Dispose()
$clients.Keys | Sort

A little performance comparison against Get-Content:

StreamReader: Completed: 5.5 seconds, PowerShell.exe: 35,328 KB RAM.

Get-Content: Completed: 23.6 seconds. PowerShell.exe: 1,110,524 KB RAM.

edited Jan 18, 2012 at 0:45

answered Jan 18, 2012 at 0:16

Andy Arismendi

53.1k17 gold badges113 silver badges130 bronze badges

Collectives™ on Stack Overflow

Need help improving performance of PowerShell delimited-text parsing script

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related