JSON character encoding in javascript different from java

Question

the java code below

    JSONObject obj = new JSONObject();
    try{
        obj.put("alert","•é");
        byte[] test = obj.toString().getBytes("UTF-8");
        logger.info("bytes are"+ test);
    } catch (JSONException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    };

produces a JSONObject which escapes the bullet character, but not the latin letter e with grave, e.g ""\u2022é", the byte code is [123, 34, 97, 108, 101, 114, 116, 34, 58, 34, 92, 117, 50, 48, 50, 50, -61, -87, 34, 125]

How can get I the same exact output in Javascript (in terms of byte sequence)? I don't understand why JSONObject is only escaping one character but not the other. I don't know what rule it followed.

It seems in javascript I can only either escape everything other than the ASCII, (eg.\u007f-\uffff) or don't escape at all.

Thanks!

What is the purpose of creating a byte[] anyway? That's a different issue that the escaping shown. — user2864740
– user2864740, Commented Jun 11, 2014 at 0:23
because the length of the byte array is used later in the backend, that is why the front end javascript code needs to calculate the exact length the final byte array in the java code — user3277841
– user3277841, Commented Jun 11, 2014 at 0:34
The back-end should calculate the length then. The front-end can guess at the length, but it is the back-end which is responsible and the authoritative source (and it should be understood that the length itself is not necessary canonical, but merely the result of the current operation). — user2864740
– user2864740, Commented Jun 11, 2014 at 0:41
Unfortunately the UI can't afford a backend call to do that, it needs to provide user feedback right away when the characters are typed in. — user3277841
– user3277841, Commented Jun 11, 2014 at 1:07

wwkudu · Accepted Answer · 2020-06-04 08:07:33Z

5

There are two different things happening: Unicode encoding and JSON string escaping.

Per 2.5 Strings of the JSON RFC:

.. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped ..

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence .. [and characters outside the BMP are escaped as UTF-16 encoded surrogate pairs]

That is, the JSON strings of "•é" and "\u2022é" are equivalent. It is entirely up to the serialization implementation on which (additional) characters to escape, and both forms are valid.

It is this JSON string (which is Unicode text) that can be encoded when converted to a byte-stream. In the example it's encoded via UTF-8 encoding. A JSON string may then be equivalent without being byte-equivalent at the stream level or character-equivalent at the JSON text level.

As far as the rules for JSONObject, it escapes according to

    c < ' '
|| (c >= '\u0080' && c < '\u00a0')
|| (c >= '\u2000' && c < '\u2100')

One reason these characters, in the range [\u2000, \u2100], may be escaped is to ensure the resulting JSON is also valid JavaScript. The article JSON: The JavaScript subset that isn't discusses the issue: the problem is the Unicode code-points \u2028 and \u2029 are treated as line terminators in JavaScript string literals, but not JSON. (There are other Unicode Separator characters in the range: might as well catch them in one go.)

edited Jun 4, 2020 at 8:07

wwkudu

2,7963 gold badges31 silver badges42 bronze badges

answered Jun 11, 2014 at 0:17

user2864740

62.5k15 gold badges159 silver badges233 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3277841 Over a year ago

I understand, but in this case, the javascript need to know exactly what the byteArray will look like in order to get the correct length (the same length the java code will use later). Thus being "equivalent" is not enough. The front end js code need to escape the json string exactly the same way as the java code.

user2864740 Over a year ago

That is not a good idea (in fact, I have a mind to say it's a terrible idea). In any case I've updated the answer to include the rules used with JSONObject. You'll have to write a custom function to perform similar escaping (such escaping is not guaranteed to be followed in any particular JSON.stringify implementation), and then create a function to UTF-8 encode or UTF-8-encoded-length-guess the result - the length in byte counting can be done merely by looking at the code point magnitudes. You'll also need to deal with whitespace between JSON tokens.

David Conrad Over a year ago

@user3277841 Why on Earth would the Javascript need to know what length byte[] the Java code would use? Isn't the whole point of JSON to have a nice, neat string format to pass around, and not have to worry about niggling bit-twiddling details like this?

user3277841 Over a year ago

because the UI needs to check the length of the string while it is being typed into a text box by the user, and the length of the string is decided by that java code in the original post.

user2864740 Over a year ago

@user3277841 The length of the string should probably be the logical characters (not the encoded length) for UI purposes. Consider adding in "slack" for the backend (i.e. a larger varchar) if possible; one could also collapse the "\u...." JSON (via a substituion regex) in the Java to decrease the expansion difference.

Collectives™ on Stack Overflow

JSON character encoding in javascript different from java

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related