Regular Expression in Java Performance Issue

Question

I want to extract all text within HTML-body-Tags with the following Java-code:

Pattern.compile(".*<\\s*body\\s*>(.*?)<\\s*/\\s*body\\s*>.*", Pattern.DOTALL);

..

matcher.find() ? matcher.group(1) : originalText

That works fine for html, but for larger texts which don't contain any html (and with that no body-elements) e.G. larger stack-traces the invocation of matcher.find() takes lots of time.

Does anyone know how what's the cause? And how to make this regular expression even more performant?

Thanks in advance!

This actually matches the whole document capturing very little.Remove the .* at the end of your regex — vks
– vks, Commented Feb 10, 2015 at 9:54
You should really look at regex quantifiers . Don't sue greedy quantifier everywhere. — TheLostMind
– TheLostMind, Commented Feb 10, 2015 at 9:56
(Java's regex package is possible to en.wikipedia.org/wiki/ReDoS.) The .* parts cause the problem in this case, but the .*? is probably even worse, as it has to backtrack more often once it finds <body>. — Gábor Bakos
– Gábor Bakos, Commented Feb 10, 2015 at 9:59

vhunsicker · Accepted Answer · 2015-02-10 11:13:14Z

2

The reg exp is now:

<\\s*?body\\s*?>(.*?)<\\s*?/\\s*?body\\s*?>

The .* at the beginning and at the end of the expression was removed and now it works properly and fast. Further all quantifiers are now non-greedy.

Thanks for your helpful comments !

edited Feb 10, 2015 at 11:13

answered Feb 10, 2015 at 10:26

vhunsicker

5396 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Regular Expression in Java Performance Issue

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related