6

I have some strings that contain XHTML character entities:

"They're quite varied"
"Sometimes the string ∈ XML standard, sometimes ∈ HTML4 standard"
"Therefore -> I need an XHTML entity decoder."
"Sadly, some strings are not valid XML & are not-quite-so-valid HTML <- but I want them to work, too."

Is there any easy way to decode the entities? (I'm using Java)

I'm currently using StringEscapeUtils.unescapeHtml4(myString.replace("&apos;", "\'")) as a temporary hack. Sadly, org.apache.commons.lang3.StringEscapeUtils has unescapeHtml4 and unescapeXML, but no unescapeXhtml.

EDIT: I do want to handle invalid XML, for example I want "&&xyzzy;" to decode to "&&xyzzy;"

EDIT: I think HTML5 has almost the same character entities as XHTML, so I think HTML 5 decoder would be fine too.

8
  • 1
    Aren't XHTML and HTML entities equivalent? Commented Feb 19, 2014 at 14:31
  • 1
    hint: XHTML is valid XML Commented Feb 19, 2014 at 14:32
  • @SotiriosDelimanolis: No. That's the problem. Commented Feb 19, 2014 at 14:37
  • 1
    @JanDvorak: If the input was guaranteed to be valid XHTML, then I'd be happy. Furthermore, XML by itself doesn't have all the HTML references. Commented Feb 19, 2014 at 14:38
  • Wikipedia says otherwise. Commented Feb 19, 2014 at 14:41

2 Answers 2

1

This may not be directly relevant but you may wish to adopt JSoup which handles things like that albeit from a higher level. Includes web page cleaning routines.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, looks great, but in my use case it would be an overkill.
There is no such thing as overkill - only problems and solutions. JSoup is a solution and a far better one than doing manual search & replaces.
1

Have you tried to implement a XHTMLStringEscapeUtils based on the facilities provide by org.apache.commons.text.StringEscapeUtils?

import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.*;

public class XHTMLStringEscapeUtils {
    public static final CharSequenceTranslator ESCAPE_XHTML =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_ESCAPE),
                    new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE)
            ).with(StringEscapeUtils.ESCAPE_XML11);

    public static final CharSequenceTranslator UNESCAPE_XHTML =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
                    new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
                    new NumericEntityUnescaper(),
                    new LookupTranslator(EntityArrays.APOS_UNESCAPE)
            );

    public static final String escape(final String input) {
        return ESCAPE_XHTML.translate(input);
    }

    public static final String unescape(final String input) {
        return UNESCAPE_XHTML.translate(input);
    }
}

Thanks to the modular design of Apache commons-text lib, it's easy to create custom escape utils.

You can find a full project with tests here xhtml-string-escape-utils

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.