1

I'm trying to format large text files (~300MB) between 0 to 3 columns :

12345|123 Main St, New York|91110
23456|234 Main St, New York
34567|345 Main St, New York|91110

And the output should be:

000000000012345,"123 Main St, New York",91110,,,,,,,,,,,,
000000000023456,"234 Main St, New York",,,,,,,,,,,,,
000000000034567,"345 Main St, New York",91110,,,,,,,,,,,,

I'm new to powershell, but I've read that I should avoid Get-Content so I am using StreamReader. It is still much too slow:

function append-comma{} #helper function to append the correct amount of commas to each line


$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"

$target_file_in = New-Object System.IO.StreamReader -Arg $infile

If ($header -eq 'TRUE') {
    $firstline = $target_file_in.ReadLine() #skip header if exists
}

while (!$target_file_in.EndOfStream ) {

    $line = $target_file_in.ReadLine() 
    $a = $line.split($separator)[0].trim()
    $b = ""
    $c = ""
    if ($dataType -eq 'ECN'){$a = $a.padleft(15,'0')}
    if ($line.split($separator)[1].length -gt 0){$b = $line.split($separator)[1].trim()}
    if ($line.split($separator)[2].length -gt 0){$c = $line.split($separator)[2].trim()}
    $line = $a +',"'+$b+'","'+$c +'"'
    $line -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |append-comma >> $outfile
}

$target_file_in.close()

I am building this for other people on my team and wanted to add a gui using this guide: http://blogs.technet.com/b/heyscriptingguy/archive/2014/08/01/i-39-ve-got-a-powershell-secret-adding-a-gui-to-scripts.aspx

Is there a faster way to do this in Powershell? I wrote a script using Linux bash(Cygwin64 on Windows) and a separate one in Python. Both ran much faster, but I am trying to script something that would be "approved" on a Windows Platform.

2
  • Are the numbers in the first field always going to have 5 digits? Also, is trimming required (i.e. is there a possiblity of fields having leading/trailing whitespace)? Commented May 15, 2015 at 22:51
  • The first field could be anywhere from 1-15 characters, but should end up being 15 characters total after leftpad. Trim is necessary. Commented May 15, 2015 at 23:11

4 Answers 4

2

All that splitting and replacing costs you way more time than you gain from the StreamReader. Below code cut execution time to ~20% for me:

$separator = '|'
$infile    = "\large_data.csv"
$outfile   = "new_file.csv"

if ($header -eq 'TRUE') {
  $linesToSkip = 1
} else {
  $linesToSkip = 0
}

Get-Content $infile | select -Skip $linesToSkip | % {
  [int]$a, [string]$b, [string]$c = $_.split($separator)
  '{0:d15},"{1}",{2},,,,,,,,,,,,,' -f $a, $b.Trim(), $c.Trim()
} | Set-Content $outfile
Sign up to request clarification or add additional context in comments.

2 Comments

I would like to use this, but Get-Content reads the entire file into memory. I get out-of-memory errors when using it on large files.
@Liturgist The entire purpose of using a pipeline like this is to avoid reading the entire file into memory. Please post a new question with your actual code and evidence.
1

How does this work for you? I was able to read and process a 35MB file in about 40 seconds on a cheap ole workstation.

File Size: 36,548,820 bytes

Processed In: 39.7259722 seconds

Function CheckPath {
[CmdletBinding()]
    param(
        [Parameter(Mandatory=$True,
        ValueFromPipeline=$True)]
        [string[]]$Path
    )
    BEGIN {}
    PROCESS {
        IF ((Test-Path -LiteralPath $Path) -EQ $False) {Write-host "Invalid File Path $Path"}
    }
    END {}
}

$infile = "infile.txt"
$outfile = "restult5.txt"

#Check File Path
CheckPath $InFile

#Initiate StreamReader
$Reader = New-Object -TypeName System.IO.StreamReader($InFile);

#Create New File Stream Object For StreamWriter
$WriterStream = New-Object -TypeName System.IO.FileStream(
 $outfile,
 [System.IO.FileMode]::Create,
 [System.IO.FileAccess]::Write);

#Initiate StreamWriter
$Writer = New-Object -TypeName System.IO.StreamWriter(
 $WriterStream,
 [System.Text.Encoding]::ASCII);

If ($header -eq $True) {
    $Reader.ReadLine() |Out-Null #Skip First Line In File
}

while ($Reader.Peek() -ge 0) {
    $line = $Reader.ReadLine() #Read Line
    $Line = $Line.split('|') #Split Line
    $OutPut = "$($($line[0]).PadLeft(15,'0')),`"$($Line[1])`",$($Line[2]),,,,,,,,,,,,"
    $Writer.WriteLine($OutPut)
}

$Reader.Close();
$Reader.Dispose();
$Writer.Flush();

$Writer.Close();
$Writer.Dispose();

$endDTM = (Get-Date) #Get Script End Time For Measurement

Write-Host "Elapsed Time: $(($endDTM-$startDTM).totalseconds) seconds" #Echo Time elapsed

1 Comment

Your code is exactly what I need. The stream-reading/writing circumvents the memory overload. 275MB file takes about 910 seconds. This should be added to the top of the code to retrieve the start tiem for the final "Elapsed Time" calculation: $startDTM = (Get-Date) #Get Script Start Time For Measurment
0

Regex is fast:

$infile = ".\large_data.csv"
gc $infile|%{ 
    $x=if($_.indexof('|')-ne$_.lastindexof('|')){
        $_-replace'(.+)\|(.+)\|(.+)',('$1,"$2",$3'+','*12)
    }else{
        $_-replace'(.+)\|(.+)',('$1,"$2"'+','*14)
    }
    ('0'*(15-($x-replace'([^,]),.+','$1').length))+$x
}

Comments

0

I have another approach. Let powershell read the input file as a csv file, with a pipe character as delimiter. Then format the output the way you want it. I have not tested this for speed with large files.

$infile = "\large-data.csv"
$outfile = "new-file.csv"

import-csv $infile -header id,addr,zip -delimiter "|" |
% {'{0},"{1}",{2},,,,,,,,,,,,,' -f $_.id.padleft(15,'0'), $_.addr.trim(), $_.zip} |
set-content $outfile

3 Comments

This does not cover what the OP is really looking for which is data manipulation. He has also shown good research into using large files. If you want this to work you need to generate the output the OP desires and it would not be a bad idea to look into Measure-Command to compare results to see if it is faster or at least comparable. You could start with mockaroo.com to get your source data if you would like.
OK, I've rewritten this based on your comment. I'm not sure which goal is most important, speed or acceptance in the Windows platform.
You are right the OP is asking 2 questions really but now you cover at least both of them. This is better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.