Java and .NET/PowerShell producing different UTF-8 bytes

Question

am getting grey hair over this. I need to convert strings in PowerShell to UTF-8. My reference code is in Java (and works as intended with the bigger application), so I need to reproduce what it does.

In Java, I do:

    private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();

    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for (int j = 0; j < bytes.length; j++) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = HEX_ARRAY[v >>> 4];
            hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
        }
        return new String(hexChars);
    }
    
    public static void main(String[] args) throws Exception {
        System.out.println(bytesToHex("aöß".getBytes("UTF8")));
    }

which outputs 61C3B6C39F.

In PowerShell, I do

Write-Output $(([System.Text.UTF8Encoding]::New($false, $true).getBytes("aöß") | ForEach-Object ToString X2) -join '')

which outputs 61C383C2B6C383C5B8

Why are they different? How can I make the PowerShell encoding match the Java one?

I would be very grateful for any insights!

Best eDude

EDIT: Ok, now I am more confused. When running the above command in the PowerShell 5.1 console, it works as expected. When putting it into a script file and executing that, it does not.

EDIT 2: More info, if the script file is saved in UTF-8 encoding, the error appears. If it is saved in another encoding (e.g. Notepad++'s ANSI), it works. Why is the encoding of the script file changing the behavior of the script itself? How can I prevent this and make sure to get consistent results?

I'm unable to reproduce this, I get 61c3b6c39f in both PowerShell 5.1, 7.1 and 7.2. Which version are you using? — Mathias R. Jessen
– Mathias R. Jessen, Commented Jan 5, 2022 at 12:00

antonyoni · Accepted Answer · 2022-01-05 21:31:27Z

1

Try converting your script file to UTF-8-BOM encoding in Notepad++ and running it. PowerShell 5's default encoding is Western European (Windows) (windows-1252) so when there's no BOM in your script file it reads it as UTF-16, thus the double-length string.

Default encoding in PowerShell 7 is UTF-8, so it shouldn't be a problem.

You can check the default encoding for the different powershell versions like this:

PS> [System.Text.Encoding]::Default

You can also specify the required characters to avoid this issue in files without a BOM:

$str = [char]0x0061 + [char]0x00F6 + [char]0x00DF

Write-Output $(([System.Text.Encoding]::UTF8.GetBytes($str) | ForEach-Object ToString X2) -join '')

answered Jan 5, 2022 at 21:31

antonyoni

9296 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

explorerDude Over a year ago

Thank you, that works! Still a weird that the encoding of the script file itself modifies its behavior...

Collectives™ on Stack Overflow

Java and .NET/PowerShell producing different UTF-8 bytes

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related