6

I have the need to parse through a large pipe-delimited file to count the number of records whose 5th column meets and doesn't meet my criteria.

PS C:\temp> gc .\items.txt -readcount 1000 | `
  ? { $_ -notlike "HEAD" } | `
  % { foreach ($s in $_) { $s.split("|")[4] } } | `
  group -property {$_ -ge 256} -noelement | `
  ft –autosize

This command does what I want, returning output like this:

  Count Name
  ----- ----
1129339 True
2013703 False

However, for a 500 MB test file, this command takes about 5.5 minutes to run as measured by Measure-Command. A typical file is over 2 GB, where waiting 20+ minutes is undesirably long.

Do you see a way to improve the performance of this command?

For example, is there a way to determine an optimum value for Get-Content's ReadCount? Without it, it takes 8.8 minutes to complete the same file.

4
  • Have you tried StreamReader? I think that Get-Content loads the whole file into memory before it does anything with it. Commented Jan 17, 2012 at 21:52
  • You mean by importing System.IO? Commented Jan 17, 2012 at 21:59
  • Yeah, use the .net framework if you can. I used to to read large log files that SQL Server generates with good results. I don't know any other way in powershell to read large files efficiently but I'm no expert. Commented Jan 17, 2012 at 22:08
  • @Gisli, if you write you comment as an answer, I can upvote it and eventually accept it. Using StreamReader allowed me to get the time to 1 minute for the test file. Commented Jan 17, 2012 at 22:40

3 Answers 3

4

Have you tried StreamReader? I think that Get-Content loads the whole file into memory before it does anything with it.

StreamReader class

Sign up to request clarification or add additional context in comments.

Comments

4

Using @Gisli's hint, here's the script I ended up with:

param($file = $(Read-Host -prompt "File"))
$fullName = (Get-Item "$file").FullName
$sr = New-Object System.IO.StreamReader("$fullName")
$trueCount = 0; 
$falseCount = 0; 
while (($line = $sr.ReadLine()) -ne $null) {
      if ($line -like 'HEAD|') { continue }
      if ($line.split("|")[4] -ge 256) { 
            $trueCount++
      }
      else {
            $falseCount++
      }
}
$sr.Dispose() 
write "True count:   $trueCount"
write "False count: $falseCount"

It yields the same results in about a minute, which meets my performance requirements.

Comments

2

Just adding another example using StreamReader to read through a very large IIS log file and outputting all unique client IP addresses and some perf metrics.

$path = 'A_245MB_IIS_Log_File.txt'
$r = [IO.File]::OpenText($path)

$clients = @{}

while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()

    # String processing here...
    if (-not $line.StartsWith('#')) {
        $split = $line.Split()
        $client = $split[-5]
        if (-not $clients.ContainsKey($client)){
            $clients.Add($client, $null)
        }
    }
}

$r.Dispose()
$clients.Keys | Sort

A little performance comparison against Get-Content:

StreamReader: Completed: 5.5 seconds, PowerShell.exe: 35,328 KB RAM.

Get-Content: Completed: 23.6 seconds. PowerShell.exe: 1,110,524 KB RAM.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.