0

solution: this works:

String p="<pre>[\\\\w\\\\W]*</pre>";

I want to match and capture the enclosing content of the <pre></pre> tag tried the following, not working, what's wrong?

String p="<pre>.*</pre>";

        Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input);
        if(m.find()){
            String g=m.group(0);
            System.out.println("g is "+g);
        }
3
  • 2
    Seriously, you shouldn't be parsing HTML with regular expressions. Use a library such as TagSoup instead. Commented May 8, 2010 at 0:20
  • <sigh> here we go again ... did you try a search? how about this guidance - stackoverflow.com/questions/2400623/… Commented May 8, 2010 at 0:25
  • 1
    [\\\\w\\\\W] will match a backslash, w or W. You probably meant [\\w\\W], but you don't need to do that. Just use the DOTALL flag, as I said in my answer. That other trick is used a lot in JavaScript because JS has no equivalent for the DOTALL flag. Commented May 8, 2010 at 1:10

3 Answers 3

4

Regex is in fact not the right tool for this. Use a parser. Jsoup is a nice one.

Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
    System.out.println(element.text());
}

The parse() method can also take an URL or File by the way.


The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. It not only provides JavaScript like methods returning elements implementing Iterable, but it also supports jQuery like selectors and that was a big plus for me.

Sign up to request clarification or add additional context in comments.

Comments

3

You want the DOTALL flag, not MULTILINE. MULTILINE changes the behavior of the ^ and $, while DOTALL is the one that lets . match line separators. You probably want to use a reluctant quantifier, too:

String p = "<pre>.*?</pre>";

1 Comment

If there's more than one <pre> element, a greedy .* will match from the first opening <pre> to the last closing </pre>. The reluctant (or non-greedy) .*? will stop at the first closing tag.
1
String stringToSearch = "H1 FOUR H1 SCORE AND SEVEN YEARS AGO OUR FATHER...";

// the case-insensitive pattern we want to search for
Pattern p = Pattern.compile("H1", Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(stringToSearch);

// see if we found a match
int count = 0;
while (m.find())
    count++;

System.out.println("H1 : "+count);   

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.