ASCII to HTML-Entities Escaping in Java

Question

I found this website with escape codes and I'm just wondering if someone has done this already so I don't have to spend couple of hours building this logic:

 StringBuffer sb = new StringBuffer();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     switch (c) {
         case '\u25CF': sb.append("&#9679;"); break;
         case '\u25BA': sb.append("&#9658;"); break;

         /*
         ... the rest of the hex chars literals to HTML entities
         */  

         default:  sb.append(c); break;
     }
 }

Do you want the exact same value, or do you need to have some values converted to something else? — Thorbjørn Ravn Andersen
– Thorbjørn Ravn Andersen, Commented Mar 26, 2011 at 8:27
@Mat Banik - re: the results; you sure you don't have a transcoding error at the compilation stage? See here: illegalargumentexception.blogspot.com/2009/05/… — McDowell
– McDowell, Commented Mar 26, 2011 at 15:36

Pawel Veselov · Accepted Answer · 2011-03-27 04:09:35Z

3

These "codes" is a mere decimal representation of the unicode value of the actual character. It seems to me that something like this would work, unless you want to be very strict about which codes get converted, and which don't.

StringBuilder sb = new StringBuilder();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append((int)c);
        sb.append(';');
     } else {
        sb.append(c);
     }

 }

edited Mar 27, 2011 at 4:09

answered Mar 26, 2011 at 7:03

Pawel Veselov

4,2738 gold badges50 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Paŭlo Ebermann Over a year ago

You should take care of surrogate pairs, too. (Which means iterating over code points, not code units.)

robinst Over a year ago

As Paŭlo mentioned, this code is broken for surrogate pairs (e.g. emojis). See my answer for handling them correctly.

robinst · Accepted Answer · 2016-05-05 01:36:03Z

The other answers don't work correctly for surrogate pairs, e.g. if you have Emojis such as "😀" (see character info). Here's how to do it in Java 8:

StringBuilder sb = new StringBuilder();
s.codePoints().forEach(codePoint -> {
    if (Character.UnicodeBlock.of(codePoint) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(codePoint);
        sb.append(';');
    } else {
        sb.appendCodePoint(codePoint);
    }
});

And for older Java:

StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
    int c = s.codePointAt(i);
    if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(c);
        sb.append(';');
    } else {
        sb.appendCodePoint(c);
    }
    i += Character.charCount(c);
}

A simple way to test if a solution handles surrogate pairs correctly is to use "\uD83D\uDE00" (😀) as the input. If the output is "&#55357;&#56832;", then it's wrong. The correct output is 😀.

WhiteFang34 · Accepted Answer · 2011-03-26 11:55:49Z

0

Hmm, what if you did something like this instead:

if (c > 127) {
    sb.append("&#" + (int) c + ";");
} else {
    sb.append(c);
}

Then you just need to determine the range of characters you want HTML escaped. In this case I just specified any character beyond the ASCII table space.

edited Mar 26, 2011 at 11:55

answered Mar 26, 2011 at 7:03

WhiteFang34

72.2k18 gold badges110 silver badges112 bronze badges

2 Comments

WhiteFang34 Over a year ago

Looks like Pawel has a more complete answer.

McDowell Over a year ago

255 is too high for ASCII - it's only 7-bit so you'd want 127.

Collectives™ on Stack Overflow

ASCII to HTML-Entities Escaping in Java

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related