How can I unescape HTML character entities in Java?

Question

Basically, I would like to decode a given HTML document, and replace all special characters, such as " " → " " and ">" → ">".

In .NET, we can make use of the HttpUtility.HtmlDecode method.

What's the equivalent function in Java?

is called character entity. Edited the title.

Eugene Yokota
– Eugene Yokota

2009-06-15 02:46:06 +00:00
Commented Jun 15, 2009 at 2:46 — Eugene Yokota
– Eugene Yokota, Commented Jun 15, 2009 at 2:46

Vivien · Accepted Answer · 2019-08-30 09:48:04Z

227

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

edited Aug 30, 2019 at 9:48

Vivien

678 bronze badges

answered Jun 15, 2009 at 2:43

Kevin Hakanson

42.4k24 gold badges132 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sid Over a year ago

Sadly I just realized today that it does not decode HTMLspecial characters very well :(

setzamora Over a year ago

a dirty trick is to store the value initially in a hidden field to escape it, then the target field should get the value from the hidden field.

Pauli Over a year ago

Class StringEscapeUtils is deprecated and moved to Apache commons-text

Nickkk Over a year ago

I want to convert the string <p>üè</p> to <p>üé</p>, with StringEscapeUtils.unescapeHtml4() I get <p>üè</p>. Is there a way to keep existing html tags intact?

ifly6 Over a year ago

If I have something like  which escapes to a quotation mark in Windows-1252 but some control character in Unicode, can the escaping encoding be changed?

|

Peter Mortensen · Accepted Answer · 2023-05-03 13:33:16Z

70

The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world HTML content in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. It's open source and MIT License.

edited May 3, 2023 at 13:33

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered May 17, 2016 at 13:25

Dale

5,9975 gold badges51 silver badges89 bronze badges

6 Comments

SourceVisor Over a year ago

upvote+, but I should point that newer versions of Jsoup use .text() instead of .getText()

danneu Over a year ago

Perhaps more direct is to use org.jsoup.parser.Parser.unescapeEntities(String string, boolean inAttribute). API docs: jsoup.org/apidocs/org/jsoup/parser/…

MandisaW Over a year ago

This was perfect, since I'm already using Jsoup in my project. Also, @danneu was right - Parser.unescapeEntities works exactly as advertised.

DavesPlanet Over a year ago

why then does the following not return un-escaped html: Parser.unescapeEntities(Jsoup.parse("<div>•</div>").text(), true)

DavesPlanet Over a year ago

@Dale see question 77405300 for details but what happens is I get a utf16 string with a bullet point out instead of the text

|

Peter Mortensen · Accepted Answer · 2023-05-03 14:16:22Z

I tried Apache Commons' StringEscapeUtils.unescapeHtml3() in my project, but I wasn't satisfied with its performance. It turns out, it does a lot of unnecessary operations. For one, it allocates a StringWriter for every call, even if there's nothing to unescape in the string. I've rewritten that code differently, and now it works much faster.

The following code unescapes all HTML 3 symbols and numeric escapes (equivalent to Apache unescapeHtml3). You can just add more entries to the map if you need HTML 4.

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // Look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // Found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // Found escape
            if (input.charAt(i) == '#') {
                // Numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null)
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) {
                    i++;
                    continue;
                }
            }
            else {
                // Named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null)
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // Skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"},   // Non-breaking space
        {"\u00A1", "iexcl"},  // Inverted exclamation mark
        {"\u00A2", "cent"},   // Cent sign
        {"\u00A3", "pound"},  // Pound sign
        {"\u00A4", "curren"}, // Currency sign
        {"\u00A5", "yen"},    // Yen sign = yuan sign
        {"\u00A6", "brvbar"}, // Broken bar = broken vertical bar
        {"\u00A7", "sect"},   // Section sign
        {"\u00A8", "uml"},    // Diaeresis = spacing diaeresis
        {"\u00A9", "copy"},   // © - copyright sign
        {"\u00AA", "ordf"},   // Feminine ordinal indicator
        {"\u00AB", "laquo"},  // Left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"},    // Not sign
        {"\u00AD", "shy"},    // Soft hyphen = discretionary hyphen
        {"\u00AE", "reg"},    // ® - registered trademark sign
        {"\u00AF", "macr"},   // Macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"},    // Degree sign
        {"\u00B1", "plusmn"}, // Plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"},   // Superscript two = superscript digit two = squared
        {"\u00B3", "sup3"},   // Superscript three = superscript digit three = cubed
        {"\u00B4", "acute"},  // Acute accent = spacing acute
        {"\u00B5", "micro"},  // Micro sign
        {"\u00B6", "para"},   // Pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // Middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"},  // Cedilla = spacing cedilla
        {"\u00B9", "sup1"},   // Superscript one = superscript digit one
        {"\u00BA", "ordm"},   // Masculine ordinal indicator
        {"\u00BB", "raquo"},  // Right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // Vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // Vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // Vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // Inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // А - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
        {"\u00C2", "Acirc"},  // В - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // Г - uppercase A, tilde
        {"\u00C4", "Auml"},   // Д - uppercase A, umlaut
        {"\u00C5", "Aring"},  // Е - uppercase A, ring
        {"\u00C6", "AElig"},  // Ж - uppercase AE
        {"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // И - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
        {"\u00CA", "Ecirc"},  // К - uppercase E, circumflex accent
        {"\u00CB", "Euml"},   // Л - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // М - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
        {"\u00CE", "Icirc"},  // О - uppercase I, circumflex accent
        {"\u00CF", "Iuml"},   // П - uppercase I, umlaut
        {"\u00D0", "ETH"},    // Р - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // С - uppercase N, tilde
        {"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // У - uppercase O, acute accent
        {"\u00D4", "Ocirc"},  // Ф - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // Х - uppercase O, tilde
        {"\u00D6", "Ouml"},   // Ц - uppercase O, umlaut
        {"\u00D7", "times"},  // Multiplication sign
        {"\u00D8", "Oslash"}, // Ш - uppercase O, slash
        {"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
        {"\u00DB", "Ucirc"},  // Ы - uppercase U, circumflex accent
        {"\u00DC", "Uuml"},   // Ь - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
        {"\u00DE", "THORN"},  // Ю - uppercase THORN, Icelandic
        {"\u00DF", "szlig"},  // Я - lowercase sharps, German
        {"\u00E0", "agrave"}, // а - lowercase a, grave accent
        {"\u00E1", "aacute"}, // б - lowercase a, acute accent
        {"\u00E2", "acirc"},  // в - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // г - lowercase a, tilde
        {"\u00E4", "auml"},   // д - lowercase a, umlaut
        {"\u00E5", "aring"},  // е - lowercase a, ring
        {"\u00E6", "aelig"},  // ж - lowercase ae
        {"\u00E7", "ccedil"}, // з - lowercase c, cedilla
        {"\u00E8", "egrave"}, // и - lowercase e, grave accent
        {"\u00E9", "eacute"}, // й - lowercase e, acute accent
        {"\u00EA", "ecirc"},  // к - lowercase e, circumflex accent
        {"\u00EB", "euml"},   // л - lowercase e, umlaut
        {"\u00EC", "igrave"}, // м - lowercase i, grave accent
        {"\u00ED", "iacute"}, // н - lowercase i, acute accent
        {"\u00EE", "icirc"},  // о - lowercase i, circumflex accent
        {"\u00EF", "iuml"},   // п - lowercase i, umlaut
        {"\u00F0", "eth"},    // р - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // с - lowercase n, tilde
        {"\u00F2", "ograve"}, // т - lowercase o, grave accent
        {"\u00F3", "oacute"}, // у - lowercase o, acute accent
        {"\u00F4", "ocirc"},  // ф - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // х - lowercase o, tilde
        {"\u00F6", "ouml"},   // ц - lowercase o, umlaut
        {"\u00F7", "divide"}, // Division sign
        {"\u00F8", "oslash"}, // ш - lowercase o, slash
        {"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
        {"\u00FB", "ucirc"},  // ы - lowercase u, circumflex accent
        {"\u00FC", "uuml"},   // ь - lowercase u, umlaut
        {"\u00FD", "yacute"}, // э - lowercase y, acute accent
        {"\u00FE", "thorn"},  // ю - lowercase thorn, Icelandic
        {"\u00FF", "yuml"},   // я - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES)
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

Recently, I had to optimize a slow Struts project. It turned out that under the cover Struts calls Apache for html string escaping by default (<s:property value="..."/>). Turning off escaping (<s:property value="..." escaping="false"/>) got some pages to run 5% to 20% faster.
A StringWriter uses a StringBuffer internally which uses locking. Using a StringBuilder directly should be faster.
found a bug in the above code when encountering "=" aka =. writer.write(entityValue); should be writer.write(Character.toString((char)entityValue)); – Stevko 4 hours ago
@NickFrolov, your comments seem a bit messed up. auml is for instance ä and not д.
Improved version with all HTML5 characters: gist.github.com/MarkJeronimus/798c452582e64410db769933ec71cfb7

Herman Bovens · Accepted Answer · 2020-05-14 09:10:44Z

21

Spring Framework HtmlUtils

If you're using Spring framework already, use the following method:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;

...

String result = htmlUnescape(source);

answered May 14, 2020 at 9:10

Herman Bovens

12.5k5 gold badges50 silver badges64 bronze badges

Comments

Stephan · Accepted Answer · 2016-07-27 12:02:45Z

17

The following library can also be used for HTML escaping in Java: unbescape.

HTML can be unescaped this way:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

edited Jul 27, 2016 at 12:02

answered Jul 13, 2014 at 22:59

Stephan

43.2k69 gold badges245 silver badges342 bronze badges

2 Comments

user1191027 Over a year ago

It did nothing to this:

%3Chtml%3E%0D%0A%3Chead%3E%0D%0A%3Ctitle%3Etest%3C%2Ftitle%3E%0D%0A%3C%2Fhead%3E%0D%0A%3Cbody%3E%0D%0Atest%0D%0A%3C%2Fbody%3E%0D%0A%3C%2Fhtml%3E

Mikhail Batcer Over a year ago

@ThreaT Your text is not html-encoded, it is url-encoded.

Peter Mortensen · Accepted Answer · 2023-05-03 14:21:34Z

12

This did the job for me,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml(encodedXML);

Or

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML = StringEscapeUtils.unescapeHtml4(encodedXML);

I guess it’s always better to use the lang3 for obvious reasons.

edited May 3, 2023 at 14:21

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Apr 19, 2017 at 2:31

tk_

17.5k9 gold badges88 silver badges90 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2023-05-03 14:22:53Z

4

A very simple, but inefficient solution without any external library is:

public static String unescapeHtml3(String str) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read(new StringReader("<html><body>" + str), doc, 0);
        return doc.getText(1, doc.getLength());
    } catch(Exception ex) {
        return str;
    }
}

This should be used only if you have only small count of string to decode.

edited May 3, 2023 at 14:22

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Dec 3, 2016 at 22:07

Horcrux7

24.6k23 gold badges107 silver badges171 bronze badges

1 Comment

Greg Over a year ago

Very close, but not exact - it converted "qwAS12ƷƸǅǚǪǼȌ" to "qwAS12ƷƸǅǚǪǼȌ\n".

Floern · Accepted Answer · 2017-09-12 21:43:21Z

3

The most reliable way is with

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

from org.apache.commons.lang3.StringEscapeUtils.

And to escape the whitespaces

cleanedString = cleanedString.trim();

This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB.

edited Sep 12, 2017 at 21:43

Floern

34k24 gold badges107 silver badges122 bronze badges

answered Sep 12, 2017 at 21:16

mike oganyan

1671 silver badge5 bronze badges

Comments

Pramod H G · Accepted Answer · 2021-09-09 12:07:23Z

1

StringEscapeUtils (Apache Commons Lang)
Escapes and unescapes Strings for Java, JavaScript, HTML, and XML.

import org.apache.commons.lang.StringEscapeUtils;
....
StringEscapeUtils.unescapeHtml(comment);

Reference: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

answered Sep 9, 2021 at 12:07

Pramod H G

1,65317 silver badges21 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2023-05-03 13:45:30Z

0

Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities, like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.

edited May 3, 2023 at 13:45

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jun 3, 2014 at 23:25

Joost

1418 bronze badges

1 Comment

Peter Mortensen Over a year ago

222 is likely in octal (hexadecimal 0x92. decimal 146). In Windows-1252 (but not in ISO 8859-1), 0x92 corresponds to U+2019 (RIGHT SINGLE QUOTATION MARK). Are you sure it is not octal 221? Or right single quote?

Peter Mortensen · Accepted Answer · 2023-05-03 14:19:32Z

0

In my case, I use the replace method by testing every entity in every variable. My code looks like this:

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

In my case this worked very well.

edited May 3, 2023 at 14:19

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Mar 26, 2015 at 12:51

Luiz dev

432 bronze badges

2 Comments

Sandy Gifford Over a year ago

This isn't every special entity. Even the two mentioned in the question are missing.

denov Over a year ago

this will not scale well

Peter Mortensen · Accepted Answer · 2023-05-03 13:42:18Z

-7

In case you want to mimic what PHP function htmlspecialchars_decode() does, use PHP function get_html_translation_table() to dump the table and then use the Java code like,

static Map<String, String> html_specialchars_table = new Hashtable<String, String>();

static {
    html_specialchars_table.put("&lt;", "<");
    html_specialchars_table.put("&gt;", ">");
    html_specialchars_table.put("&amp;", "&");
}

static String htmlspecialchars_decode_ENT_NOQUOTES(String s) {
    Enumeration en = html_specialchars_table.keys();
    while(en.hasMoreElements()) {
        String key = en.nextElement();
        String val = html_specialchars_table.get(key);
        s = s.replaceAll(key, val);
    }
    return s;
}

edited May 3, 2023 at 13:42

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Apr 18, 2012 at 6:52

Bala Dutt

551 bronze badge

Collectives™ on Stack Overflow

How can I unescape HTML character entities in Java?

12 Answers 12

6 Comments

6 Comments

6 Comments

Spring Framework HtmlUtils

Comments

2 Comments

Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

6 Comments

6 Comments

6 Comments

Spring Framework HtmlUtils

Comments

2 Comments

Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related