Python Regex working different depending on the implementation?

Question

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:

Iterate through lines. Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.

When testing code on following line (part of bigger js file of course):

document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";

I've encountered problem. My regular expression code:

re_strings=re.compile("""   "
                            (?:
                            \\.|
                            [^\\"]
                            )*
                            "
                            |
                            '
                            (?:
                                [^\\']|
                                \\.
                            )*
                            '
                            """,re.VERBOSE);


for s in re.finditer(re_strings,line):
            print(s.group(0))

In python 3.2.3 (and 3.1.4) returns the following strings:

"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"

Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference. The most peculiar thing is they both return same, correct results other than my code, that is:

"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"

My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...

P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.

P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.

this is not going to work due to the nature of regex

Joran Beasley
– Joran Beasley

2012-08-30 22:58:01 +00:00
Commented Aug 30, 2012 at 22:58 — Joran Beasley
– Joran Beasley, Commented Aug 30, 2012 at 22:58

Alan Moore · Accepted Answer · 2012-08-31 03:02:15Z

2

You just need to use a raw string to create the regex:

re_strings=re.compile(r"""   "
                             etc.
                             "
                        """,re.VERBOSE);

The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.

See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)

answered Aug 31, 2012 at 3:02

Alan Moore

75.6k13 gold badges109 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wookie88 Over a year ago

Wow, that's really a simple mistake that I have made. I thought that triple-quote string already is a raw string. Thanks for the link-this online tool could be pretty handy.

Community · Accepted Answer · 2017-05-23 11:49:06Z

1

you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)

source: lots of CS classes...

and also see : Matching pair tag with regex for a more detailed explanation

I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

edited May 23, 2017 at 11:49

CommunityBot

11 silver badge

answered Aug 30, 2012 at 22:52

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

4 Comments

Wookie88 Over a year ago

I've also attended to some CS classes, but maybe I forgot something. As far as I know regex is a state machine, when it meets OR it tries to match branches in left-to-right order (that is in Python-I agree that it CAN vary between languages). So when you are looking for a single quoted string you can write something like "(\\.|[^\\"])*" and it should work. I don't see here any chances to multi-interpret this regex-inside string if you meet \ it must be followed by any other character, so if it's ", we are stil in (...)* that must be ended with ". Please correct me if I'm wrong.

Joran Beasley Over a year ago

I dunno I think you are right ... but all i remember from class about this was that you absolutely cannot match nested stuff(quotes were main example)... it was drilled into us pretty hard...

Vatine Over a year ago

Actually, you can say "one FOO", "followed by 0 or more of NO FOO", "followed by one FOO". But you cannot deal with nested things and dealing with escaped foos gets relaly hairy and what the question's asker actually wants is a full-fledged parses.

Wookie88 Over a year ago

Yeah, nesting with regex is quite hard and can lots of resources. I've encountered before a problem that Python couldn't finish the findall method for whole file, so I started iterating through lines and used non-regex method for telling if ` // ` is outside quotes.

Collectives™ on Stack Overflow

Python Regex working different depending on the implementation?

2 Answers 2

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related