Java regex performance

Question

I'm trying to parse links with regex with Java.

But I think it's getting too slow. For example, to extract all links from:

http://news.google.com.ar/nwshp?hl=es&tab=wn

...it's spending 34642 milliseconds (34 seconds!!!)

Here is the regex:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

The flags for the pattern:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

And the code may be something like this:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp, flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

Can you give me a Hint?

Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded

does it run any faster if your download the entire page first, then run your regex? Are you sure it's the regex taking so long and not the incremental download? — Steven
– Steven, Commented Oct 12, 2010 at 0:54
Yes Steven, it's the regex stuff. I'm doing some Profiling with Diferente Parsers. — santiagobasulto
– santiagobasulto, Commented Oct 12, 2010 at 1:59

Stephen C · Accepted Answer · 2010-10-12 13:58:45Z

13

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.

edited Oct 12, 2010 at 13:58

answered Oct 11, 2010 at 22:57

Stephen C

723k95 gold badges849 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

santiagobasulto Over a year ago

Thanks brother, i'll try with javax.swing.text.html.HTMLEditorKit

user207421 · Accepted Answer · 2010-10-11 23:36:37Z

3

All your time, all of it, is being spent here:

 html+=line;

Use a StringBuffer. Better still, if you can, run the match on every line and don't accumulate them at all.

answered Oct 11, 2010 at 23:36

user207421

312k45 gold badges324 silver badges493 bronze badges

3 Comments

santiagobasulto Over a year ago

Good hint, anyway, i need to construct all the HTML becouse lines can be broken.

santiagobasulto Over a year ago

Sorry EJP. It doesn't change anything based on the great time it takes to parse with regex. It does change something, but is nothing comparing it with the overall time. It is a good tip though, i've coded it.

Kartal Tabak Over a year ago

I know that is an old topic, but what user207421 say is true. I tried your code: it is still taking 7120 ms on my machine. However, when you replace String html with StringBuilder html, it consumes 2200ms. Now, most of the part is to retrieve the data.

Peter Lang · Accepted Answer · 2012-06-05 10:18:21Z

3

I have written simple test for comparing 10 million operation RegExp performance against String.indexof() with the following result:

0.447 seconds
6.174 seconds
13.812080536912752 times regexp longer.

import java.util.regex.Pattern;

public class TestRegExpSpeed {
    public static void main(String[] args) {
        String match = "FeedUserMain_231_Holiday_Feed_MakePresent-1_";
        String unMatch = "FeedUserMain_231_Holiday_Feed_Make2Present-1_";

        long start = System.currentTimeMillis();
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                match.indexOf("MakePresent");
            } else {
                unMatch.indexOf("MakePresent");
            }
        }

        double indexOf = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(indexOf + " seconds");

        start = System.currentTimeMillis();
        Pattern compile = Pattern.compile(".*?MakePresent.*?");
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                compile.matcher(match).matches();
            } else {
                compile.matcher(unMatch).matches();
            }
        }
        double reaexp = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(reaexp + " seconds");

        System.out.println(reaexp / indexOf + " times regexp longer. ");
    }
}

edited Jun 5, 2012 at 10:18

Peter Lang

55.7k29 gold badges153 silver badges165 bronze badges

answered Jun 5, 2012 at 8:54

Jonny

471 silver badge1 bronze badge

1 Comment

Alan Moore Over a year ago

This is not an answer. Yes, indexOf() is faster than regex for that particular task, but why would you use a regex for that? How would the OP use indexOf() to solve his problem?

zengr · Accepted Answer · 2016-07-12 21:46:22Z

3

Try Jaunt instead. Please don't use regex for this.

Regex use vs. Regex abuse

Regular expressions are not Parsers. Although you can do some amazing things with regular expressions, they are weak at balanced tag matching. Some regex variants have balanced matching, but it is clearly a hack -- and a nasty one. You can often make it kinda-sorta work, as I have in the sanitize routine. But no matter how clever your regex, don't delude yourself: it is in no way, shape or form a substitute for a real live parser.

Source

edited Jul 12, 2016 at 21:46

answered Oct 11, 2010 at 22:59

zengr

39.1k38 gold badges134 silver badges196 bronze badges

2 Comments

zengr Over a year ago

Regex is a very expensive operation. And while you are scraping a website, you will need to parse alot of text. acorns.com.au/blog/?p=136

jutky Over a year ago

The complexity of evaluating regex is something terrible according to the linear complexity of parsing the HTML page

Collectives™ on Stack Overflow

Java regex performance

4 Answers 4

1 Comment

3 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related