extraction of text from HTML web pages using java [duplicate]

Question

Possible Duplicate:
Text extraction with java html parsers

I m new to java and is trying to program an algorithm for web page classification. I want to know how to extract text from HTML web pages in java. Would be of great help if I could get a base idea of what to do.

Thanks Archana

Also a possible duplicate of stackoverflow.com/questions/1386107/… & stackoverflow.com/questions/3036638/… — Saurabh Gokhale
– Saurabh Gokhale, Commented Mar 12, 2011 at 15:13

Community · Accepted Answer · 2017-05-23 12:26:50Z

0

You could turn to already existing HTML parsing tools, such as JSOUP, once you obtained the raw HTML String.

look here for a comparison What are the pros and cons of the leading Java HTML parsers?

Also find a quick example of what you could easily extract from an HTML page using JSOUP and the CSS selectors http://jsoup.org/cookbook/extracting-data/example-list-links

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Mar 12, 2011 at 15:10

Joey

1,34914 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user656673 Over a year ago

hey guys..thnx a lot 4 da suggestions..finally using jsoup and it works!!

bmargulies · Accepted Answer · 2012-08-19 21:54:22Z

0

I use Jericho to convert an HTML document to text. The code to get the text is pretty simple:

    Source source = new Source(html);
    Renderer renderer = source.getRenderer();
    String text = renderer.toString();

There are some options you can set on the renderer to adjust the texification, like:

renderer.setIncludeHyperlinkURLs(false);

edited Aug 19, 2012 at 21:54

bmargulies

101k40 gold badges196 silver badges327 bronze badges

answered May 16, 2011 at 13:59

Cooper

1,37011 silver badges17 bronze badges

Comments

Mozart Brocchini · Accepted Answer · 2011-03-15 20:44:38Z

-1

@Codemwnci's answer helps you download the HTML page.

If you're looking for a way to separate HTML markup tags from content, you should use an HTML parser.

edited Mar 15, 2011 at 20:44

answered Mar 12, 2011 at 15:11

Mozart Brocchini

3921 gold badge3 silver badges11 bronze badges

2 Comments

Richard H Over a year ago

-1 for suggesting regular expressions to parse HTML.

Mozart Brocchini Over a year ago

@Richard, I agree that regular expressions will probably not be the best choice, but I also suggested using a parser, I actually edited my response and reordered the suggestions after your -1. The reason for suggesting regular expressions is cases like only getting the text from a certain HTML tag.

Collectives™ on Stack Overflow

extraction of text from HTML web pages using java [duplicate]

3 Answers 3

1 Comment

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Linked

Related