1

I want to extract text blocks from a HTML page and I'm using boilerpipe to do this. It works fine for one text in a page, but some pages like blogs have multiple texts in the page.

I want to extract all texts, but identifying each one as a separate text, and not only one.

There is some library that can do this?

EDIT: I'm using Jsoup to parse HTML, but I don't want do parsing, but information extraction like boilerpipe do in the pages. I want to test other similar tool.

2
  • Please provide more details, What extractor are you using? Have you tried using ArticleExtractor? i tried using ArticleExtractor to fetch the content of stackoverflow post and it extracted all the text for me? it would be better for us to debug, if you provide some sample code. Commented Jan 20, 2012 at 12:44
  • @rao_555 All the text as one text or multiple texts? Commented Jan 20, 2012 at 13:23

3 Answers 3

3

JSoup is very widely used parser for these type of tasks. Please check it.

Sign up to request clarification or add additional context in comments.

Comments

2

Well, personally I liked using Doj together with HtmlUnit. Basically Doj introduces something similar to CSS selectors for Java.

Example (from official page):

Doj spanDoj = Doj.on(page).get("#updates tr", 1).get("td", 2).get("span.item");

You can see more complex example on the linked page (scroll it down).

Comments

1

The closest Java library I'm aware of is the Road Runner project: http://www.dia.uniroma3.it/db/roadRunner/ It's a system that can construct a special kind of regular expression on tokens in the HTML document which can (in many cases) detect patterns of this kind given several documents based on the same template. This might be achieved for blogs by, for example, looking at paginated pages. You would probably still have to pick out precisely which repeated patterns were the ones of interest for each site.

For blogs, I would probably look for a feed link in the header of the blog and use a feed parsing library to parse out the permalinks for each article. Crawl those and use boilerpipe (only necessary because lots of blogs don't include the full text in the RSS/Atom feed). Lots of blogs don't include the full text on the main page either, so I'd focus on methods of identifying the permalinks, and go from there.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.