Question about parsing HTML using Regex and Java

Question

I Have a question about finding html tags using Java and Regex.

I am using the code below to find all the tags in HTML, documentURL is obviously the HTML content.

The find method return true, meaning that it can find something in the HTML but the matches() method always return false and I am completly and utterly puzzled about this.

I refered to Java documentations too but could not find my answer.

What is the correct way of using Matcher ?

    Pattern keyLineContents = Pattern.compile("(<.*?>)");

    Matcher keyLineMatcher = keyLineContents.matcher(documentURL);

    boolean result = keyLineMatcher.find();

    boolean matchFound = keyLineMatcher.matches();

Doing something like this throws an exeption:

     String abc = keyLineMatcher.group(0);

Thanks.

Not the answer you wanted, but avoid parsing html with regex. The correct way is to use a HTML parser. java-source.net/open-source/html-parsers — Yacoby
– Yacoby, Commented Mar 6, 2010 at 23:21

cletus · Accepted Answer · 2010-03-06 23:25:15Z

7

The correct way to loop through matches is:

Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
  System.out.println(m.group());
}

That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using > inside attribute values and so on.

Use a dedicated HTML parser instead such as HTML Parser.

answered Mar 6, 2010 at 23:25

cletus

627k169 gold badges922 silver badges945 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stephen C Over a year ago

"I will use html parser later". That's what they all say ... :-)

Finbarr · Accepted Answer · 2010-03-06 23:59:26Z

2

Why don't you try looking at the source code of some open source HTML Parsers? HtmlCleaner, Tagsoup etc.

The general strategy seems to be to attempt to parse and clean the html and return an Xml tree.

Personally, I would read through the HTML adding opening tags to a LIFO Queue, and removing (matching) opening tags from the start of the queue when a closing tag is encountered - performing queue shifting to allow for tag mismatches.

answered Mar 6, 2010 at 23:59

Finbarr

32.3k13 gold badges67 silver badges94 bronze badges

1 Comment

Alan Moore Over a year ago

Is this answer in response to @Raha's question about writing one's own HTML parser?

Alan Moore · Accepted Answer · 2010-03-07 01:41:05Z

1

I want to get keyword content from HTML tag I wrote:

Pattern keyLineContents = Pattern.compile("<(.[^<]*)(keywords)(.[^<]*)>");
Matcher keyLineMatcher = keyLineContents.matcher(documentURL);
boolean result = keyLineMatcher.find();
while(result)
{
  String metaTagContent = keyLineMatcher.group(1) + " " + keyLineMatcher.group(3);
  Pattern kcontent = Pattern.compile("(.*?content=\")(.[^<]*?)(\".[^<]*?)");
  Matcher keyLineMatcher2 = kcontent.matcher(metaTagContent);
  boolean result2 = keyLineMatcher.find();
  while (result2)
  {
    String metaTagContent2 = keyLineMatcher.group(1);
    result2 = keyLineMatcher.find();
  }
}

But I don't understand why my result2 is false. Result one is fine give all content of keyword tag

thanks

edited Mar 7, 2010 at 1:41

Alan Moore

75.6k13 gold badges109 silver badges161 bronze badges

answered Mar 7, 2010 at 0:16

Elham

7772 gold badges11 silver badges23 bronze badges

1 Comment

Alan Moore Over a year ago

Try these regexes instead: "<([^<]*)(keywords)([^<]*)>" and ".*?content=\"([^<]*?)\""

Collectives™ on Stack Overflow

Question about parsing HTML using Regex and Java

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related