How to fix unexpected output of Encoding.ASCII.GetBytes

Question

I am seeing an unexpected character (?) in the output of Encoding.ASCII.GetBytes method.

So I am doing the following:

var stringBytes = Encoding.ASCII.GetBytes(myString);

Where myString is:

{
  "$id": "1",
  "Note": "<p><span style=\"font-family: &quot;Courier New&quot;;\">aaaa</span> 
  <br></p>"
}

Now right after if I do:

var myString1 = System.Text.Encoding.Default.GetString(stringBytes)

Then myString1 is returned as:

{
  "$id": "1",
  "Note": "<p><span style=\"font-family: &quot;Courier New&quot;;\">? 
   aaaa</span><br></p>"
}

Note how the aaaa is transformed to ?aaaa in the last operation?

Can someone please tell me what I missing here? Thank you.

Why are you using Encoding.Default to decode a string encoded with Encoding.ASCII? Even if your system did default to Encoding.ASCII for Encoding.Default, it seems like a bad idea in general. *On .NET Core Encoding.Default is always Encoding.UTF8. — ProgrammingLlama
– ProgrammingLlama, Commented Apr 12, 2019 at 1:31
Thanks @John, yes you are right. I missed that. I will fix it but, that didn't fix the above problem. I believe Alexi's solution is a possible fix. Cheers. — Stackedup
– Stackedup, Commented Apr 12, 2019 at 1:38

Alexei Levenkov · Accepted Answer · 2019-04-12 04:34:28Z

5

This is expected behavior of ASCII encoding when it finds character outside 0-127 range like in your case. To fix - either switch to UTF8 (as it supports all character) or manually encode all characters outside 0-127 into something that works for you (for JSON you can use hex encoding with "\u" prefix - "\ufeff" )

The string "aaaa" for some reason starts with BOM (0xFEFF) which you can't see, but it is there and has to be converted to "?" by ASCII encoding. To see the character code - select piece of string and print it as HEX:

  ((int)(">aaaa"[1])).ToString("x")  // gives FEFF on your string of length 6

Note that BOM (byte order mark) in the middle of the text is usually a bug, in this case is likely the code that constructs HTML is concatenating files or something similar. Guidance from Unicode.org - What should I do with U+FEFF in the middle of a file?

_{Thanks to Klaus Gütter for the link to BOM FAQ and Tom Blodget for highlighting issues with BOM in the middle of a text.}

edited Apr 12, 2019 at 4:34

answered Apr 12, 2019 at 1:29

Alexei Levenkov

101k15 gold badges138 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Stackedup Over a year ago

Alexi, thank you very much. Yes now it makes sense. I used UTF8 to both encode and decode and it is working as I expect it. Cheers.

Tom Blodget Over a year ago

JSON is required to be encoded with UTF-8 for inter-system communication.

Tom Blodget Over a year ago

@Stackedup A BOM should not be allowed to make it into a text datatype. A BOM is metadata not text.

Stackedup Over a year ago

@TomBlodget thank you for highlighting that. That is going to be difficult to figure out. I am using Summernote editor. So I am reading its HTML content (.summernote('code')) and pass it to the server. So the bug could be in the Summernote.

Klaus Gütter Over a year ago

@TomBlodget Relevant section in BOM FAQ: What should I do with U+FEFF in the middle of a file?

Collectives™ on Stack Overflow

How to fix unexpected output of Encoding.ASCII.GetBytes

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related