8

Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:

Paging Lucene&#39s search results (with **;** among **&#39** and **s**)

field after decoding:

Paging Lucene's search results

Is there any class in java that will allow me to convert these html codes?

3
  • Does your HTML contain tags? Commented Dec 6, 2012 at 18:43
  • Yes, but the field extracted doesn't contain tags Commented Dec 6, 2012 at 18:44
  • 5
    For starters, using regex to parse HTML is utterly wrong in first place. Just use a HTML parser like Jsoup. A bit decent one would immediately already unescape HTML for you. Commented Dec 6, 2012 at 18:47

2 Answers 2

31

Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Sign up to request clarification or add additional context in comments.

1 Comment

3

Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as λ, λ or λ

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™ for example is not valid, yet many browsers will interpret it as .

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

  • Feed string into a robust HTML parser
  • Get parsed (and fully decoded) string back

9 Comments

I need to extract from htmls with same structures and tags (like wikipedia). So I think regex is a good approach.
@MrCarAsus: NO IT IS NOT. Use a HTML parser, and DOM for extraction. That is what they are for!
Try using DBPedia, btw. It is an already parsed version of Wikipedia.
And do you know a parsed version of StackOverflow? I try to use regex with stackoverflow htmls and it works. I extract title and answers with a set of regexps applied on htlm.
@MikeSamuel The page says in number 3: "not ... in the range U+0080–U+009F". 0x0099 is in this range.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.