Python regex not returning as expected [duplicate]

Question

I am parsing some text with Python and am running into an odd issue...

an example text that is being parsed:

msg:"ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i"; reference:url,www.securityfocus.com/bid/37446/info; reference:url,doc.emergingthreats.net/2010602; classtype:web-application-attack; sid:2010602; rev:4; metadata:created_at 2010_07_30, updated_at 2010_07_30;

my regex:

msgSearch = re.search(r'msg:"(.+)";",line)

actual result:

ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i

expected result:

ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt

There are 10s of thousands of lines of text that I am parsing that are all giving me similar results. Any reason regex is picking a (seemingly) random "; to stop at? I can fix the example above by making the regex more specific, eg. r'msg:"([\w\s\.]+)";" but other lines have different characters included. I guess I could just include every special character in my regex, but I'm trying to understand why my wildcard isn't working properly.

Any help would be appreciated!

With your shown samples please try \bmsg:\"([^\"]+)\" regex Online demo is regex101.com/r/U6ASno/1 — RavinderSingh13
– RavinderSingh13, Commented Jul 28, 2022 at 4:03
Does this answer your question? python regex first/shortest match — outis
– outis, Commented Jul 28, 2022 at 4:41

Vasyl Moskalov · Accepted Answer · 2022-07-28 04:07:40Z

1

Try this one:

re.search(r'msg:"([^;]+)";',line)

answered Jul 28, 2022 at 4:07

Vasyl Moskalov

4,6403 gold badges24 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pdp8 · Accepted Answer · 2022-07-28 04:03:15Z

1

The .+ is by default "greedy", i.e. it will match as many characters as possible. In your case, it will stop at the last "; sequence, not at the next one. To make it non-greedy (or lazy), try .+? :

 msgSearch = re.search(r'msg:"(.+?)";",line)

answered Jul 28, 2022 at 4:03

pdp8

2081 silver badge6 bronze badges

1 Comment

jrod091 Over a year ago

thanks for the explanation about .+ being greedy! for some reason, (.+?) returned an empty string :( but the other examples of using the "everything except" regex (([^;]+) and ([^\"]+)) worked out.

Collectives™ on Stack Overflow

Python regex not returning as expected [duplicate]

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related