15

I am new in java. I am getting java Stack overflow Exception in regex strHindiText. What should I do for that?

try {
     // This regex convert the pattern "{\fldrslt {\fcs1 \ab\af24 \fcs0 ऩ}{"
     // into "{\fldrslt {\fcs1 \ab\af24 \fcs0 ऩ}}}{"
     // strHindiText = strHindiText.replaceAll("\\{(\\\\fldrslt[ ])\\{((\\\\\\S+[ ])+)((\\s*&#\\d+;\\s*(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*)+)\\}\\{","{$1{$2$4}}}{");

     // This regex convert the pattern "{\fcs0 \af0 &#2345;{ or {\fcs0 \af0 *\tab &#2345;{" 
     // into "{\fcs0 \af0 &#2345; }{"
     strHindiText = strHindiText.replaceAll("\\{\\s*((\\\\\\S+[ ](\\*)?)+\\s*)(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*[ ]*(((&#\\d+;)[ ]*(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*[ ]*)+)\\{", "{$1 $4$5 }{");

     // This regex convert the pattern "{&#2345; \fcs0 \af0 {" 
     // into "{&#2345; \fcs0 \af0 }{"
     strHindiText = strHindiText.replaceAll("\\{\\s*(((&#\\d+;)[ ]*(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*[ ]*)+)[ ]*((\\\\\\S+[ ])+)\\{", "{$1 $5 }{");

     } catch(StackOverflowError er) {
            System.out.println("Third try Block StackOverflowError in regex pattern to reform the rtf tags................");
            er.printStackTrace();
        //  throw er;
     }



Whenever these strHindiText contain large data it gives an java stackoverflow exception:

java.lang.StackOverflowError
2013-08-08 15:35:07,743 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match0(Pattern.java:3754)
2013-08-08 15:35:07,743 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
2013-08-08 15:35:07,744 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
2013-08-08 15:35:07,744 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
2013-08-08 15:35:07,745 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
2013-08-08 15:35:07,745 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)



My strHindiText data is:

 `{\rtlch\fcs1 \af1\afs18 \ltrch\fcs0 \f1\fs18\cf21\insrsid13505584 &#2349;&#2379;&#2346;&#2366;&#2354;&#32; &#2404; \par }\pard\plain \ltrpar\s16\ql \li0\ri0\sb100\sa100\sbauto1\saauto1\sl240\slmult0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0\pararsid13505584 \cbpat20 \rtlch\fcs1 \af0\afs24\alang1025 \ltrch\fcs0 \fs24\lang1033\langfe1033\cgrid\langnp1033\langfenp1033 {\rtlch\fcs1 \ab\af1\afs18 \ltrch\fcs0 \cs21\b\f1\fs18\cf21\insrsid13505584 &#2309;&#2344;&#2381;&#2357;&#2375;&#2359;&#2339;&#32;&#2325;&#2352;&#2375;&#2306;&#32; :}{\rtlch\fcs1 \af1\afs18 \ltrch\fcs0 \f1\fs18\cf21\insrsid13505584  \par &#2349;&#2379;&#2346;&#2366;&#2354;&#32;&#44;&#32;&#2350;&#2343;&#2381;&#2351;&#32;&#2346;&#2381;&#2352;&#2342;&#2375;&#2358;&#32;&#2325;&#2368;&#32;&#2352;&#2366;&#2332;&#2343;&#2366;&#2344;&#2368;&#32;&#2346;&#2381;&#2352;&#2366;&#2325;&#2371;&#2340;&#2367;&#2325;&#32;&#2360;&#2369;&#2306;&#2342`
13
  • 8
    Your alternative paths | are probably causing recursive calls, resulting in the stackoverflow. Regex stuff is complicated in general, and your regex is big. I'm not surprised. Commented Aug 8, 2013 at 10:02
  • 1
    I would suggest instead of alternatives (e.g a|b|c) to use the alternative notation: [abc], this should make the regex clearer, and you just need to escape the closing bracket and no other character. Also, it looks like you want to do something that regexes aren't good for - parsing - for something that isn't text but has a higher ordering. Commented Aug 8, 2013 at 10:33
  • 7
    You really shouldn't use RegEx for such enormous parsings.. it's not very performant, since the regex expression compiles every time you try to match a string. Commented Aug 8, 2013 at 11:13
  • 6
    Everything about your code is asking for problems. Try breaking the problem into multiple small problems rather than trying to do a bazillion things all at once with a giant regex. Based on the regexes you're using, I'd be surprised if you didn't experience memory problems. Commented Aug 8, 2013 at 19:32
  • 1
    I would personally recommend writing a parser for your RTF rather than attempting to cut it up with regex. Regex is meant for simple things, and I don't imagine RTF in Hindi is simple at all. Commented Aug 8, 2013 at 20:34

3 Answers 3

3

Option 1 - Treat the symptoms

Look for recursive calls in your regex.

If you are not sure where your problem lies: try a regex tester like this.

Option 2 - Treat the cause (much better)

Don't use a regex if there are better tools for your task.

In your case you could: Search for a RTF parsing library or write your own parser.
e.g. like the one here that jahroy pointed out in the comments.

Sign up to request clarification or add additional context in comments.

Comments

1

This is not a full answer but just for your information.

In your regex:

(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)* can be written as [-,/()\";.'<>:?]*

Since this pattern occurs twice (in your first regex), this immediately shortens your regex by 40 characters and makes those sections much more readable.

Comments

0

Try this to catch the error

public class Example {
    public static void endless() {
        endless();
    }

    public static void main(String args[]) {
        try {
            endless();
        } catch(StackOverflowError t) {
            // more general: catch(Error t)
            // anything: catch(Throwable t)
            System.out.println("Caught "+t);
            t.printStackTrace();
        }
        System.out.println("After the error...");
    }
}

More importantly try increasing the size of the stack add this to your regex

+'xss='xss

adding the "+" symbol changes the operator to prevent back tracking since this doesnt seem to be necessary in your case.

8 Comments

He should consider using the right tool for the job rather than treating the symptoms that result from using the wrong tool...
chances are the overflow is coming from recursive issues not greediness from the regex. By making the operator possessive we can eliminate branching and recursive handling making this expression more efficient and allows for less memory usage.
I would either look for an RTF parsing library or write one myself. If I wrote one myself I would break up the parsing into small tasks rather than try to do everything at once. If I had to use regexes, I would keep them small and simple and make sure they only operate on small pieces of text. I would never consider feeding the entire document to a single, complicated regex.
It took about 5 seconds of googling to find this (maybe it will help, maybe it won't...)
Ok. Sorry if my comments were overly harsh. This whole "I must use regex" mentality is just so common on this site that it sometimes makes you want to scream from the top of the mountain: "not all problems must be solved with regex!"
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.