0

I have this code that works like a charm for small files. It just dumps the whole file into memory, replaces NUL and writes back to the same file. This is not really very practical for huge files when file size is larger than the available memory. Can someone help me convert it to a streaming model such that it won't choke for huge files.

Get-ChildItem -Path "Drive:\my\folder\path" -Depth 2 -Filter *.csv | 
Foreach-Object {
$content = Get-Content $_.FullName
#Replace NUL and save content back to the original file
$content -replace "`0","" | Set-Content $_.FullName
}
4
  • 2
    What research efforts have you undertaken thus far? Commented Apr 20, 2021 at 17:27
  • What's up with the replace pattern? You are escaping the 0. if your intent is to replace the zero it might not work. Let me know and I'll update my answer accrdingly. Commented Apr 20, 2021 at 18:03
  • 1
    @steven backtick0 in powershell is ASCII 0 / NUL character in the CSV file, that I am trying to replace with empty string. Commented Apr 20, 2021 at 22:06
  • I just thought of this, but it might be a better practice to try and match \0. I know in other RegEx flavors that'll match a NUL. Typical advice in PowerShell is to use the RegEx metacharacters for operators like -replace & -split. Commented May 2, 2021 at 16:42

2 Answers 2

1

The way you have this structured the entire file contents have to be read into memory. Note: That reading a file into memory uses 3-4x the file size in RAM, which's documented here.

Without getting into .Net classes, particularly [System.IO.StreamReader], Get-Content is actually very memory efficient, you just have to leverage the pipeline so you don't build up the data in memory.

Note: if you do decide to try StreamReader, the article will give you some syntax clues. Moreover, that topic has been covered by many others on the web.

Get-ChildItem -Path "C:\temp" -Depth 2 -Filter *.csv | 
ForEach-Object{
    $CurrentFile = $_
    $TmpFilePath = Join-Path $CurrentFile.Directory.FullName ($CurrentFile.BaseName + "_New" + $CurrentFile.Extension)
    
    Get-Content $CurrentFile.FullName |
    ForEach-Object{ $_ -replace "`0","" } |
    Add-Content $TmpFilePath 

    # Now that you've got the new file you can rename it & delete the original:
    Remove-Item -Path $CurrentFile.FullName
    Rename-Item -Path $TmpFilePath -NewName $CurrentFile.Name
} 

This is a streaming model, Get-Content is streaming inside the outer ForEach-Object loop. There may be other ways to do it, but I chose this so I could keep track of the names and do the file swap at the end...

Note: Per the same article, in terms of speed Get-Content is quite slow. However, your original code was likely already suffering that burden. Moreover, you can speed it up a bit using the -ReadCount XXXX parameter. That will send some number of lines down the pipe at a time. That of course does use more memory, so you'd have to find a level that helps you say within the boundaries of your available RAM. Performance improvement with -ReadCount is mentioned in this answer's comments.

Update Based on Comments:

Here's an example of using StreamReader/Writer to perform the same operations from the previous example. This should be just as memory efficient as Get-Content, but should be much faster.

Get-ChildItem -Path "C:\temp" -Depth 2 -Filter *.csv | 
ForEach-Object{
    $CurrentFile = $_.FullName
    $CurrentName = $_.Name
    $TmpFilePath = Join-Path $_.Directory.FullName ($_.BaseName + "_New" + $_.Extension)
    
    $StreamReader = [System.IO.StreamReader]::new( $CurrentFile )
    $StreamWriter = [System.IO.StreamWriter]::new( $TmpFilePath )

    While( !$StreamReader.EndOfStream )
    {
        $StreamWriter.WriteLine( ($StreamReader.ReadLine() -replace "`0","") )
    }
    
    $StreamReader.Close()
    $StreamWriter.Close()

    # Now that you've got the new file you can rename it & delete the original:
    Remove-Item -Path $CurrentFile
    Rename-Item -Path $TmpFilePath -NewName $CurrentName
} 

Note: I have some sense this issue is rooted in encoding. The Stream constructors do accept an encoding enum as an argument.

Available Encodings:

[System.Text.Encoding]::BigEndianUnicode
[System.Text.Encoding]::Default
[System.Text.Encoding]::Unicode
[System.Text.Encoding]::UTF32
[System.Text.Encoding]::UTF7
[System.Text.Encoding]::UTF8

So if you wanted to instantiate the streams with, for example, UTF8:

    $StreamReader = [System.IO.StreamReader]::new( $CurrentFile, [System.Text.Encoding]::UTF8 )
    $StreamWriter = [System.IO.StreamWriter]::new( $TmpFilePath, [System.Text.Encoding]::UTF8 )

The streams do default to UTF8. I think the system default is typically code page Windows 1251.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your reply. I will explore the StreamReader since that appears to be the right way forward for very large files >= 50GB
No pressure and I only point out because you're a new contributor, but if this answer helped get you to a solution and if your comfortable with it consider marking it answered using the checkmark to the left.
0

This would be the simplest way using the least memory, one line at a time, to another file. But it needs double the disk space.

get-content file.txt | % { $_ -replace "`0" } | set-content file2.txt 

1 Comment

Sorry to be critical, but I think this is included in (my answer)[stackoverflow.com/a/67184066/4749264]. I covered additional steps like renaming the files and also an alternate approach using Streams.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.