-1

Possible Duplicate:
Text extraction with java html parsers

I m new to java and is trying to program an algorithm for web page classification. I want to know how to extract text from HTML web pages in java. Would be of great help if I could get a base idea of what to do.

Thanks Archana

1

3 Answers 3

0

You could turn to already existing HTML parsing tools, such as JSOUP, once you obtained the raw HTML String.

look here for a comparison What are the pros and cons of the leading Java HTML parsers?

Also find a quick example of what you could easily extract from an HTML page using JSOUP and the CSS selectors http://jsoup.org/cookbook/extracting-data/example-list-links

Sign up to request clarification or add additional context in comments.

1 Comment

hey guys..thnx a lot 4 da suggestions..finally using jsoup and it works!!
0

I use Jericho to convert an HTML document to text. The code to get the text is pretty simple:

    Source source = new Source(html);
    Renderer renderer = source.getRenderer();
    String text = renderer.toString();

There are some options you can set on the renderer to adjust the texification, like:

renderer.setIncludeHyperlinkURLs(false);

Comments

-1

@Codemwnci's answer helps you download the HTML page.

If you're looking for a way to separate HTML markup tags from content, you should use an HTML parser.

2 Comments

-1 for suggesting regular expressions to parse HTML.
@Richard, I agree that regular expressions will probably not be the best choice, but I also suggested using a parser, I actually edited my response and reordered the suggestions after your -1. The reason for suggesting regular expressions is cases like only getting the text from a certain HTML tag.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.