0

I want to extract all text within HTML-body-Tags with the following Java-code:

Pattern.compile(".*<\\s*body\\s*>(.*?)<\\s*/\\s*body\\s*>.*", Pattern.DOTALL);

..

matcher.find() ? matcher.group(1) : originalText

That works fine for html, but for larger texts which don't contain any html (and with that no body-elements) e.G. larger stack-traces the invocation of matcher.find() takes lots of time.

Does anyone know how what's the cause? And how to make this regular expression even more performant?

Thanks in advance!

9
  • 2
    Don't parse HTML with regex! Commented Feb 10, 2015 at 9:53
  • 1
    This actually matches the whole document capturing very little.Remove the .* at the end of your regex Commented Feb 10, 2015 at 9:54
  • 1
    Use JSoup. Don't use regex to parse HTML. Commented Feb 10, 2015 at 9:54
  • 2
    You should really look at regex quantifiers . Don't sue greedy quantifier everywhere. Commented Feb 10, 2015 at 9:56
  • 1
    (Java's regex package is possible to en.wikipedia.org/wiki/ReDoS.) The .* parts cause the problem in this case, but the .*? is probably even worse, as it has to backtrack more often once it finds <body>. Commented Feb 10, 2015 at 9:59

1 Answer 1

2

The reg exp is now:

<\\s*?body\\s*?>(.*?)<\\s*?/\\s*?body\\s*?>

The .* at the beginning and at the end of the expression was removed and now it works properly and fast. Further all quantifiers are now non-greedy.

Thanks for your helpful comments !

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.