1

the java code below

    JSONObject obj = new JSONObject();
    try{
        obj.put("alert","•é");
        byte[] test = obj.toString().getBytes("UTF-8");
        logger.info("bytes are"+ test);
    } catch (JSONException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    };

produces a JSONObject which escapes the bullet character, but not the latin letter e with grave, e.g ""\u2022é", the byte code is [123, 34, 97, 108, 101, 114, 116, 34, 58, 34, 92, 117, 50, 48, 50, 50, -61, -87, 34, 125]

How can get I the same exact output in Javascript (in terms of byte sequence)? I don't understand why JSONObject is only escaping one character but not the other. I don't know what rule it followed.

It seems in javascript I can only either escape everything other than the ASCII, (eg.\u007f-\uffff) or don't escape at all.

Thanks!

4
  • What is the purpose of creating a byte[] anyway? That's a different issue that the escaping shown. Commented Jun 11, 2014 at 0:23
  • because the length of the byte array is used later in the backend, that is why the front end javascript code needs to calculate the exact length the final byte array in the java code Commented Jun 11, 2014 at 0:34
  • The back-end should calculate the length then. The front-end can guess at the length, but it is the back-end which is responsible and the authoritative source (and it should be understood that the length itself is not necessary canonical, but merely the result of the current operation). Commented Jun 11, 2014 at 0:41
  • Unfortunately the UI can't afford a backend call to do that, it needs to provide user feedback right away when the characters are typed in. Commented Jun 11, 2014 at 1:07

1 Answer 1

5

There are two different things happening: Unicode encoding and JSON string escaping.

Per 2.5 Strings of the JSON RFC:

.. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped ..

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence .. [and characters outside the BMP are escaped as UTF-16 encoded surrogate pairs]

That is, the JSON strings of "•é" and "\u2022é" are equivalent. It is entirely up to the serialization implementation on which (additional) characters to escape, and both forms are valid.

It is this JSON string (which is Unicode text) that can be encoded when converted to a byte-stream. In the example it's encoded via UTF-8 encoding. A JSON string may then be equivalent without being byte-equivalent at the stream level or character-equivalent at the JSON text level.


As far as the rules for JSONObject, it escapes according to

    c < ' '
|| (c >= '\u0080' && c < '\u00a0')
|| (c >= '\u2000' && c < '\u2100')

One reason these characters, in the range [\u2000, \u2100], may be escaped is to ensure the resulting JSON is also valid JavaScript. The article JSON: The JavaScript subset that isn't discusses the issue: the problem is the Unicode code-points \u2028 and \u2029 are treated as line terminators in JavaScript string literals, but not JSON. (There are other Unicode Separator characters in the range: might as well catch them in one go.)

Sign up to request clarification or add additional context in comments.

5 Comments

I understand, but in this case, the javascript need to know exactly what the byteArray will look like in order to get the correct length (the same length the java code will use later). Thus being "equivalent" is not enough. The front end js code need to escape the json string exactly the same way as the java code.
That is not a good idea (in fact, I have a mind to say it's a terrible idea). In any case I've updated the answer to include the rules used with JSONObject. You'll have to write a custom function to perform similar escaping (such escaping is not guaranteed to be followed in any particular JSON.stringify implementation), and then create a function to UTF-8 encode or UTF-8-encoded-length-guess the result - the length in byte counting can be done merely by looking at the code point magnitudes. You'll also need to deal with whitespace between JSON tokens.
@user3277841 Why on Earth would the Javascript need to know what length byte[] the Java code would use? Isn't the whole point of JSON to have a nice, neat string format to pass around, and not have to worry about niggling bit-twiddling details like this?
because the UI needs to check the length of the string while it is being typed into a text box by the user, and the length of the string is decided by that java code in the original post.
@user3277841 The length of the string should probably be the logical characters (not the encoded length) for UI purposes. Consider adding in "slack" for the backend (i.e. a larger varchar) if possible; one could also collapse the "\u...." JSON (via a substituion regex) in the Java to decrease the expansion difference.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.