0

I'm trying to figure out why there's a huge difference in the output sizes when encoding a file in base64 in Powershell vs GNU coreutils. Depending on options (UTF8 vs Unicode), the Powershell output ranges from about 240MB to 318MB. Using coreutils base64 (in Cygwin, in this case), the output is about 80MB. The original filesize is about 58MB. So, 2 questions:

  1. Why is there such a drastic difference?
  2. How can I get Powershell to give the smaller output that the GNU tool gives?

Here are the specific commands I used:

Powershell smaller output:

$input = "C:\Users\my.user\myfile.pdf"
$filecontent = get-content $input
$converted = [System.Text.Encoding]::UTF8.GetBytes($filecontent)
$encodedtext = [System.Convert]::ToBase64String($converted)
$encodedtext | Out-File "C:\Users\my.user\myfile.pdf.via_ps.base64"

The larger Powershell output came from simply replacing "UTF8" with "Unicode". It will be obvious that I'm pretty new to Powershell; I'm sure someone only slightly better with it could combine that into a couple of simple lines.

Coreutils (via Cygwin) base64:

base64.exe -w0 myfile.pdf > myfile.pdf.via_cygwin.base64
0

1 Answer 1

1

Why is there such a drastic difference?

Because you're doing something wildly different in PowerShell

How can I get Powershell to give the smaller output that the GNU tool gives?

By doing what base64 does :)


Let's have a look at what base64 ... > ... actually does:

  • base64:
    • Opens file handle to input file
    • Reads raw byte stream from disk
    • Converts every 3-byte pair to a 4-byte base64-encoded output string-fragment
  • >:
    • Writes raw byte stream to disk

Since the 4-byte output fragments only contain byte values that correspond to 64 printable ASCII characters, the command never actually does any "string manipulation" - the values on which it operates just happen to also be printable as ASCII strings and the resulting file is therefor indistinguishable from a "text file".

Your PowerShell script on the other hand does lots of string manipulation:

  • Get-Content $input:
    • Opens file handle to input file
    • Reads raw byte stream from disk
    • Decodes the byte stream according to some chosen encoding scheme (likely your OEM codepage)
  • [Encoding]::UTF8.GetBytes():
    • Re-encodes the resulting string using UTF8
  • [Convert]::ToBase64String()
    • Converts every 3-byte pair to a 4-byte base64-encoded output string-fragment
  • Out-File:
    • Encodes input string as little-endian UTF16
    • Writes to disk

The three additional string encoding steps highlighted above will result in a much-inflated byte stream, which is why you're seeing the output size double or triple.


How to base64-encode files then?

The trick here is to read the raw bytes from disk and pass those directly to [convert]::ToBase64String()

It is technically possibly to just read the entire file into an array at once:

$bytes = Get-Content path\to\file.ext -Encoding Byte # Windows PowerShell only
# or
$bytes = [System.IO.File]::ReadAllBytes($(Convert-Path path\to\file.ext))

$b64String = [convert]::ToBase64String($bytes)

Set-Content path\to\output.base64 -Value $b64String -Encoding Ascii

... I'd strongly recommend against doing so for files larger than a few kilobytes.

Instead, for file transformation in general you'll want to use streams. In this particular case, you'll want want to use a CryptoStream with a ToBase64Transform to re-encode a file stream as base64:

function New-Base64File {
    [CmdletBinding(DefaultParameterSetName = 'ByPath')]
    param(
        [Parameter(Mandatory = $true, ParameterSetName = 'ByPath', Position = 0)]
        [string]$Path,

        [Parameter(Mandatory = $true, ParameterSetName = 'ByPSPath')]
        [Alias('PSPath')]
        [string]$LiteralPath,

        [Parameter(Mandatory = $true, Position = 1)]
        [string]$Destination
    )

    # Create destination file if it doesn't exist
    if (-not(Test-Path -LiteralPath $Destination -PathType Leaf)) {
        $outFile = New-Item -Path $Destination -ItemType File
    }
    else {
        $outFile = Get-Item -LiteralPath $Destination
    }

    [void]$PSBoundParameters.Remove('Destination')

    try {
        # Open a writable file stream to the output file 
        $outStream = $outFile.OpenWrite()

        # Wrap output file stream in a CryptoStream.
        #
        # Anything that we write to the crypto stream is automatically 
        # base64-encoded and then written through to the output file stream 
        $transform = [System.Security.Cryptography.ToBase64Transform]::new()
        $cryptoStream = [System.Security.Cryptography.CryptoStream]::new($outStream, $transform, 'Write')

        foreach ($file in Get-Item @PSBoundParameters) {
            try {
                # Open readable input file stream
                $inStream = $file.OpenRead()

                # Copy input bytes to crypto stream
                # - which in turn base64-encodes and writes to output file
                $inStream.CopyTo($cryptoStream)
            }
            finally {
                # Clean up the input file stream
                $inStream | ForEach-Object Dispose
            }
        }
    }
    finally {
        # Clean up the output streams
        $transform, $cryptoStream, $outStream | ForEach-Object Dispose
    }
}

Now you can do:

$inputPath = "C:\Users\my.user\myfile.pdf"

New-Base64File $inputPath -Destination "C:\Users\my.user\myfile.pdf.via_ps.base64"

And expect an output the same size as with base64

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.