2

I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.

Right now these are the two different regex patterns I'm using to remove the specified tags.

//remove style tags and style tag content
update = update.replaceAll("<style\\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");

//remove script tags and script tag content
update = update.replaceAll("<script[\\s\\S]*?>[\\s\\S]*?</script>", "");

This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError.

I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|" in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.

I've managed to iteratively use these patterns on different test files up to 1000s of times.

My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?

If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.

Using print statements it seems that the overflow may be happening when trying to match the pattern:

"<script[\\s\\S]*?>[\\s\\S]*?</script>"

Additionally, I was told I could use this instead:

"<script[\\s\\S]+?>[\\s\\S]+?</script>"

Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.

Here is the stack trace I receive:

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)

I'm open to any and all advice. Thank you in advanced.

7
  • 1
    What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists. Commented Nov 15, 2018 at 15:22
  • apologies for the lack of info. I'm using java8 for this application. Commented Nov 15, 2018 at 15:23
  • 2
    This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J." Commented Nov 15, 2018 at 15:23
  • Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/…. Commented Nov 15, 2018 at 15:28
  • 1
    @JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process. Commented Nov 15, 2018 at 16:46

1 Answer 1

0

I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.