1

am getting grey hair over this. I need to convert strings in PowerShell to UTF-8. My reference code is in Java (and works as intended with the bigger application), so I need to reproduce what it does.

In Java, I do:

    private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();

    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for (int j = 0; j < bytes.length; j++) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = HEX_ARRAY[v >>> 4];
            hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
        }
        return new String(hexChars);
    }
    
    public static void main(String[] args) throws Exception {
        System.out.println(bytesToHex("aöß".getBytes("UTF8")));
    }

which outputs 61C3B6C39F.

In PowerShell, I do

Write-Output $(([System.Text.UTF8Encoding]::New($false, $true).getBytes("aöß") | ForEach-Object ToString X2) -join '')

which outputs 61C383C2B6C383C5B8

Why are they different? How can I make the PowerShell encoding match the Java one?

I would be very grateful for any insights!

Best eDude

EDIT: Ok, now I am more confused. When running the above command in the PowerShell 5.1 console, it works as expected. When putting it into a script file and executing that, it does not.

EDIT 2: More info, if the script file is saved in UTF-8 encoding, the error appears. If it is saved in another encoding (e.g. Notepad++'s ANSI), it works. Why is the encoding of the script file changing the behavior of the script itself? How can I prevent this and make sure to get consistent results?

2
  • 1
    Looks like the powershell is twice encoded Commented Jan 5, 2022 at 11:36
  • 1
    I'm unable to reproduce this, I get 61c3b6c39f in both PowerShell 5.1, 7.1 and 7.2. Which version are you using? Commented Jan 5, 2022 at 12:00

1 Answer 1

1

Try converting your script file to UTF-8-BOM encoding in Notepad++ and running it. PowerShell 5's default encoding is Western European (Windows) (windows-1252) so when there's no BOM in your script file it reads it as UTF-16, thus the double-length string.

Default encoding in PowerShell 7 is UTF-8, so it shouldn't be a problem.

You can check the default encoding for the different powershell versions like this:

PS> [System.Text.Encoding]::Default

You can also specify the required characters to avoid this issue in files without a BOM:

$str = [char]0x0061 + [char]0x00F6 + [char]0x00DF

Write-Output $(([System.Text.Encoding]::UTF8.GetBytes($str) | ForEach-Object ToString X2) -join '')
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, that works! Still a weird that the encoding of the script file itself modifies its behavior...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.