how to decode html codes using Java? [duplicate]

Question

Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:

Paging Lucene&#39s search results (with **;** among **&#39** and **s**)

field after decoding:

Paging Lucene's search results

Is there any class in java that will allow me to convert these html codes?

For starters, using regex to parse HTML is utterly wrong in first place. Just use a HTML parser like Jsoup. A bit decent one would immediately already unescape HTML for you. — BalusC
– BalusC, Commented Dec 6, 2012 at 18:47

Manish Singh · Accepted Answer · 2013-08-18 14:42:20Z

31

Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);

edited Aug 18, 2013 at 14:42

Manish Singh

6,2465 gold badges47 silver badges31 bronze badges

answered Dec 6, 2012 at 18:41

jlordo

37.9k7 gold badges63 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

useranon Over a year ago

commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/… - Latest link

Community · Accepted Answer · 2017-05-23 11:47:35Z

3

Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as λ, λ or &#X03bb;

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings.  for example is not valid, yet many browsers will interpret it as ™.

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

Feed string into a robust HTML parser
Get parsed (and fully decoded) string back

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Dec 6, 2012 at 19:12

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

9 Comments

user Over a year ago

I need to extract from htmls with same structures and tags (like wikipedia). So I think regex is a good approach.

Has QUIT--Anony-Mousse Over a year ago

@MrCarAsus: NO IT IS NOT. Use a HTML parser, and DOM for extraction. That is what they are for!

Has QUIT--Anony-Mousse Over a year ago

Try using DBPedia, btw. It is an already parsed version of Wikipedia.

user Over a year ago

And do you know a parsed version of StackOverflow? I try to use regex with stackoverflow htmls and it works. I extract title and answers with a set of regexps applied on htlm.

Has QUIT--Anony-Mousse Over a year ago

@MikeSamuel The page says in number 3: "not ... in the range U+0080–U+009F". 0x0099 is in this range.

|

Collectives™ on Stack Overflow

how to decode html codes using Java? [duplicate]

2 Answers 2

1 Comment

9 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

9 Comments

Linked

Related