0

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text. for ex: hello world ----> hello world

is there a way to extract the text using java standard library ? something maybe more efficient than open/close tag regex with empty string? thanks,

0

4 Answers 4

2

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner.

Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt.

Sign up to request clarification or add additional context in comments.

Comments

2

I will also say it - don't use regex with HTML. ;-)

You can give a shot with JTidy.

Comments

2

Don't use regular expression to parse HTML, use for instance jsoup: Java HTML Parser. It has a convenient way to select elements from the DOM.

Example Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser, which could be applied like this:

Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);

Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function:

@Override
public void handleStartTag(HTML.Tag tag,
        MutableAttributeSet mutableAttributeSet, int pos) {

    // parses the HTML document until a <a> or <area> tag is found
    if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {

        // reading the href attribute of the tag
        String address = (String) mutableAttributeSet
                .getAttribute(Attribute.HREF);

    /* ... */

Comments

1

You can use HTMLParser , this is a open source.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.