193

Basically, I would like to decode a given HTML document, and replace all special characters, such as " "" " and ">"">".

In .NET, we can make use of the HttpUtility.HtmlDecode method.

What's the equivalent function in Java?

1
  • 5
      is called character entity. Edited the title. Commented Jun 15, 2009 at 2:46

12 Answers 12

227

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

Sign up to request clarification or add additional context in comments.

6 Comments

Sadly I just realized today that it does not decode HTMLspecial characters very well :(
a dirty trick is to store the value initially in a hidden field to escape it, then the target field should get the value from the hidden field.
Class StringEscapeUtils is deprecated and moved to Apache commons-text
I want to convert the string <p>&uuml;&egrave;</p> to <p>üé</p>, with StringEscapeUtils.unescapeHtml4() I get &lt;p&gt;üè&lt;/p&gt;. Is there a way to keep existing html tags intact?
If I have something like &#147; which escapes to a quotation mark in Windows-1252 but some control character in Unicode, can the escaping encoding be changed?
|
70

The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world HTML content in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. It's open source and MIT License.

6 Comments

upvote+, but I should point that newer versions of Jsoup use .text() instead of .getText()
Perhaps more direct is to use org.jsoup.parser.Parser.unescapeEntities(String string, boolean inAttribute). API docs: jsoup.org/apidocs/org/jsoup/parser/…
This was perfect, since I'm already using Jsoup in my project. Also, @danneu was right - Parser.unescapeEntities works exactly as advertised.
why then does the following not return un-escaped html: Parser.unescapeEntities(Jsoup.parse("<div>&#8226;</div>").text(), true)
@Dale see question 77405300 for details but what happens is I get a utf16 string with a bullet point out instead of the text
|
46

I tried Apache Commons' StringEscapeUtils.unescapeHtml3() in my project, but I wasn't satisfied with its performance. It turns out, it does a lot of unnecessary operations. For one, it allocates a StringWriter for every call, even if there's nothing to unescape in the string. I've rewritten that code differently, and now it works much faster.

The following code unescapes all HTML 3 symbols and numeric escapes (equivalent to Apache unescapeHtml3). You can just add more entries to the map if you need HTML 4.

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // Look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // Found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // Found escape
            if (input.charAt(i) == '#') {
                // Numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null)
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) {
                    i++;
                    continue;
                }
            }
            else {
                // Named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null)
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // Skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"},   // Non-breaking space
        {"\u00A1", "iexcl"},  // Inverted exclamation mark
        {"\u00A2", "cent"},   // Cent sign
        {"\u00A3", "pound"},  // Pound sign
        {"\u00A4", "curren"}, // Currency sign
        {"\u00A5", "yen"},    // Yen sign = yuan sign
        {"\u00A6", "brvbar"}, // Broken bar = broken vertical bar
        {"\u00A7", "sect"},   // Section sign
        {"\u00A8", "uml"},    // Diaeresis = spacing diaeresis
        {"\u00A9", "copy"},   // © - copyright sign
        {"\u00AA", "ordf"},   // Feminine ordinal indicator
        {"\u00AB", "laquo"},  // Left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"},    // Not sign
        {"\u00AD", "shy"},    // Soft hyphen = discretionary hyphen
        {"\u00AE", "reg"},    // ® - registered trademark sign
        {"\u00AF", "macr"},   // Macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"},    // Degree sign
        {"\u00B1", "plusmn"}, // Plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"},   // Superscript two = superscript digit two = squared
        {"\u00B3", "sup3"},   // Superscript three = superscript digit three = cubed
        {"\u00B4", "acute"},  // Acute accent = spacing acute
        {"\u00B5", "micro"},  // Micro sign
        {"\u00B6", "para"},   // Pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // Middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"},  // Cedilla = spacing cedilla
        {"\u00B9", "sup1"},   // Superscript one = superscript digit one
        {"\u00BA", "ordm"},   // Masculine ordinal indicator
        {"\u00BB", "raquo"},  // Right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // Vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // Vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // Vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // Inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // А - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
        {"\u00C2", "Acirc"},  // В - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // Г - uppercase A, tilde
        {"\u00C4", "Auml"},   // Д - uppercase A, umlaut
        {"\u00C5", "Aring"},  // Е - uppercase A, ring
        {"\u00C6", "AElig"},  // Ж - uppercase AE
        {"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // И - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
        {"\u00CA", "Ecirc"},  // К - uppercase E, circumflex accent
        {"\u00CB", "Euml"},   // Л - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // М - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
        {"\u00CE", "Icirc"},  // О - uppercase I, circumflex accent
        {"\u00CF", "Iuml"},   // П - uppercase I, umlaut
        {"\u00D0", "ETH"},    // Р - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // С - uppercase N, tilde
        {"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // У - uppercase O, acute accent
        {"\u00D4", "Ocirc"},  // Ф - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // Х - uppercase O, tilde
        {"\u00D6", "Ouml"},   // Ц - uppercase O, umlaut
        {"\u00D7", "times"},  // Multiplication sign
        {"\u00D8", "Oslash"}, // Ш - uppercase O, slash
        {"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
        {"\u00DB", "Ucirc"},  // Ы - uppercase U, circumflex accent
        {"\u00DC", "Uuml"},   // Ь - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
        {"\u00DE", "THORN"},  // Ю - uppercase THORN, Icelandic
        {"\u00DF", "szlig"},  // Я - lowercase sharps, German
        {"\u00E0", "agrave"}, // а - lowercase a, grave accent
        {"\u00E1", "aacute"}, // б - lowercase a, acute accent
        {"\u00E2", "acirc"},  // в - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // г - lowercase a, tilde
        {"\u00E4", "auml"},   // д - lowercase a, umlaut
        {"\u00E5", "aring"},  // е - lowercase a, ring
        {"\u00E6", "aelig"},  // ж - lowercase ae
        {"\u00E7", "ccedil"}, // з - lowercase c, cedilla
        {"\u00E8", "egrave"}, // и - lowercase e, grave accent
        {"\u00E9", "eacute"}, // й - lowercase e, acute accent
        {"\u00EA", "ecirc"},  // к - lowercase e, circumflex accent
        {"\u00EB", "euml"},   // л - lowercase e, umlaut
        {"\u00EC", "igrave"}, // м - lowercase i, grave accent
        {"\u00ED", "iacute"}, // н - lowercase i, acute accent
        {"\u00EE", "icirc"},  // о - lowercase i, circumflex accent
        {"\u00EF", "iuml"},   // п - lowercase i, umlaut
        {"\u00F0", "eth"},    // р - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // с - lowercase n, tilde
        {"\u00F2", "ograve"}, // т - lowercase o, grave accent
        {"\u00F3", "oacute"}, // у - lowercase o, acute accent
        {"\u00F4", "ocirc"},  // ф - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // х - lowercase o, tilde
        {"\u00F6", "ouml"},   // ц - lowercase o, umlaut
        {"\u00F7", "divide"}, // Division sign
        {"\u00F8", "oslash"}, // ш - lowercase o, slash
        {"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
        {"\u00FB", "ucirc"},  // ы - lowercase u, circumflex accent
        {"\u00FC", "uuml"},   // ь - lowercase u, umlaut
        {"\u00FD", "yacute"}, // э - lowercase y, acute accent
        {"\u00FE", "thorn"},  // ю - lowercase thorn, Icelandic
        {"\u00FF", "yuml"},   // я - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES)
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

6 Comments

Recently, I had to optimize a slow Struts project. It turned out that under the cover Struts calls Apache for html string escaping by default (<s:property value="..."/>). Turning off escaping (<s:property value="..." escaping="false"/>) got some pages to run 5% to 20% faster.
A StringWriter uses a StringBuffer internally which uses locking. Using a StringBuilder directly should be faster.
found a bug in the above code when encountering "&#61;" aka =. writer.write(entityValue); should be writer.write(Character.toString((char)entityValue)); – Stevko 4 hours ago
@NickFrolov, your comments seem a bit messed up. auml is for instance ä and not д.
Improved version with all HTML5 characters: gist.github.com/MarkJeronimus/798c452582e64410db769933ec71cfb7
|
21

Spring Framework HtmlUtils

If you're using Spring framework already, use the following method:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;

...

String result = htmlUnescape(source);

Comments

17

The following library can also be used for HTML escaping in Java: unbescape.

HTML can be unescaped this way:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText); 

2 Comments

It did nothing to this: %3Chtml%3E%0D%0A%3Chead%3E%0D%0A%3Ctitle%3Etest%3C%2Ftitle%3E%0D%0A%3C%2Fhead%3E%0D%0A%3Cbody%3E%0D%0Atest%0D%0A%3C%2Fbody%3E%0D%0A%3C%2Fhtml%3E
@ThreaT Your text is not html-encoded, it is url-encoded.
12

This did the job for me,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml(encodedXML);

Or

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml4(encodedXML);

I guess it’s always better to use the lang3 for obvious reasons.

Comments

4

A very simple, but inefficient solution without any external library is:

public static String unescapeHtml3(String str) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read(new StringReader("<html><body>" + str), doc, 0);
        return doc.getText(1, doc.getLength());
    } catch(Exception ex) {
        return str;
    }
}

This should be used only if you have only small count of string to decode.

1 Comment

Very close, but not exact - it converted "qwAS12ƷƸDžǚǪǼȌ" to "qwAS12ƷƸDžǚǪǼȌ\n".
3

The most reliable way is with

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

from org.apache.commons.lang3.StringEscapeUtils.

And to escape the whitespaces

cleanedString = cleanedString.trim();

This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB.

Comments

1

StringEscapeUtils (Apache Commons Lang)
Escapes and unescapes Strings for Java, JavaScript, HTML, and XML.

import org.apache.commons.lang.StringEscapeUtils;
....
StringEscapeUtils.unescapeHtml(comment);

Reference: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

Comments

0

Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities, like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.

1 Comment

222 is likely in octal (hexadecimal 0x92. decimal 146). In Windows-1252 (but not in ISO 8859-1), 0x92 corresponds to U+2019 (RIGHT SINGLE QUOTATION MARK). Are you sure it is not octal 221? Or right single quote?
0

In my case, I use the replace method by testing every entity in every variable. My code looks like this:

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

In my case this worked very well.

2 Comments

This isn't every special entity. Even the two mentioned in the question are missing.
this will not scale well
-7

In case you want to mimic what PHP function htmlspecialchars_decode() does, use PHP function get_html_translation_table() to dump the table and then use the Java code like,

static Map<String, String> html_specialchars_table = new Hashtable<String, String>();

static {
    html_specialchars_table.put("&lt;", "<");
    html_specialchars_table.put("&gt;", ">");
    html_specialchars_table.put("&amp;", "&");
}

static String htmlspecialchars_decode_ENT_NOQUOTES(String s) {
    Enumeration en = html_specialchars_table.keys();
    while(en.hasMoreElements()) {
        String key = en.nextElement();
        String val = html_specialchars_table.get(key);
        s = s.replaceAll(key, val);
    }
    return s;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.