Java libraries to extract text blocks from HTML pages

Question

I want to extract text blocks from a HTML page and I'm using boilerpipe to do this. It works fine for one text in a page, but some pages like blogs have multiple texts in the page.

I want to extract all texts, but identifying each one as a separate text, and not only one.

There is some library that can do this?

EDIT: I'm using Jsoup to parse HTML, but I don't want do parsing, but information extraction like boilerpipe do in the pages. I want to test other similar tool.

Please provide more details, What extractor are you using? Have you tried using ArticleExtractor? i tried using ArticleExtractor to fetch the content of stackoverflow post and it extracted all the text for me? it would be better for us to debug, if you provide some sample code. — Rajesh Pantula
– Rajesh Pantula, Commented Jan 20, 2012 at 12:44

Santosh · Accepted Answer · 2012-01-20 15:47:37Z

3

JSoup is very widely used parser for these type of tasks. Please check it.

answered Jan 20, 2012 at 15:47

Santosh

18k4 gold badges58 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bezmax · Accepted Answer · 2012-01-20 12:41:34Z

2

Well, personally I liked using Doj together with HtmlUnit. Basically Doj introduces something similar to CSS selectors for Java.

Example (from official page):

Doj spanDoj = Doj.on(page).get("#updates tr", 1).get("td", 2).get("span.item");

You can see more complex example on the linked page (scroll it down).

answered Jan 20, 2012 at 12:41

bezmax

26.3k11 gold badges55 silver badges84 bronze badges

Comments

Lucas Wiman · Accepted Answer · 2012-01-20 19:19:32Z

The closest Java library I'm aware of is the Road Runner project: http://www.dia.uniroma3.it/db/roadRunner/ It's a system that can construct a special kind of regular expression on tokens in the HTML document which can (in many cases) detect patterns of this kind given several documents based on the same template. This might be achieved for blogs by, for example, looking at paginated pages. You would probably still have to pick out precisely which repeated patterns were the ones of interest for each site.

For blogs, I would probably look for a feed link in the header of the blog and use a feed parsing library to parse out the permalinks for each article. Crawl those and use boilerpipe (only necessary because lots of blogs don't include the full text in the RSS/Atom feed). Lots of blogs don't include the full text on the main page either, so I'd focus on methods of identifying the permalinks, and go from there.

Collectives™ on Stack Overflow

Java libraries to extract text blocks from HTML pages

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related