Combining two patterns with named capturing group in Python?

Question

I have a regular expression which uses the before pattern like so:

>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:)([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:I118uailfriedx151201005423521">>')
>>> x.group('sid')
'I118uailfriedx151201005423521'

and another like so:

>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:<<")([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:<<"I118uailfriedx151201005423521')
>>> x.group('sid')
'I118uailfriedx151201005423521'

How can I combine these two patterns in a way that, after parsing these two different lines,:

sid:A111uancalual2626x151130185758596
sid:<<"I118uailfriedx151201005423521">>

returns only the corresponding id to me.

RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))') I'm not sure if this work, but could you try to move out the sid: part before the variable? — zolo
– zolo, Commented Feb 16, 2016 at 15:27
@zolo your solution seems to be worked. If you like to write your solution as answer would free to do that. I will appreciate for complete explanation. Especially first part of your code, I am not sure if I understood it or not? why doesn't have ?p for the first part. — pm1359
– pm1359, Commented Feb 16, 2016 at 15:40

zolo · Accepted Answer · 2016-02-16 15:41:07Z

1

RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))')

Use this, I've just tested and it is working for me. I've moved some part out.

answered Feb 16, 2016 at 15:41

zolo

4792 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wiktor Stribiżew Over a year ago

@MaryamPashmi: This (<<")? is an optional group. <<" can be there, or it can go missing, a match will still be found. BTW, I'd use a non-capturing group (in order not to interfere with the re.findall): r'sid:(?:<<")?(?P<sid>[A-Za-z0-9]+)'. However, what if you have >>" or >> or just > there, in between? My solution will work, this one won't.

pm1359 Over a year ago

@zolo if u can explain it clearly will perfect. I have comment in above for u. I don't get this part sid:(<<")?(?P<sid> . In my solution I mean by ?P<sid> , just a name to be able to refer to it later.

zolo Over a year ago

I've seen (?P<variable_name>.*) style previously in another place (splunk). The statement just used for name the some part of the match. This way you may use multiple variables after each other, and not necessary to put every part of the line to the different variables. Therefore if you want to put the "sid:" part inside the variable definition, you have to use look behind assertion, which makes impossible to use (in most languages) the "?" "*" "+" and such. It looked easier to put the sid part out and I can even use the "?". I didn't changed/optimized your regexp in any way however.

alecxe · Accepted Answer · 2016-02-16 15:34:03Z

0

Instead of tweaking your regex, you can make your strings easier to parse by just removing any characters except alphanumeric and a colon. Then, just split by colon and get the last item:

>>> import re
>>> 
>>> test_strings = ['sid:I118uailfriedx151201005423521">>', 'sid:<<"I118uailfriedx151201005423521']
>>> pattern = re.compile(r"[^A-Za-z0-9:]")
>>> for test_string in test_strings:
...     print(pattern.sub("", test_string).split(":")[-1])
... 
I118uailfriedx151201005423521
I118uailfriedx151201005423521

answered Feb 16, 2016 at 15:34

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

3 Comments

pm1359 Over a year ago

I really prefer to use my solution with using 'or' between my pattern, I have a log of 10000 lines and I need to parse lots of information,I am not sure if your solution would works for me,

alecxe Over a year ago

@MaryamPashmi why are you "not sure". Giving the provided input samples, it works and, I think the solution is quite simple.

alecxe Over a year ago

@MaryamPashmi but yeah, if you want to run findall against the complete log file, then you should probably go with a single regex expression. Thanks.

Wiktor Stribiżew · Accepted Answer · 2016-02-16 15:56:17Z

0

You can achieve what you want with a single regex:

\bsid:\W*(?P<sid>\w+)

See the regex demo

The regex breakdown:

\bsid - whole word sid
: - a literal colon
\W* - zero or more non-word characters
(?P<sid>\w+) - one or more word characters captured into a group named "sid"

Python demo:

import re
p = re.compile(r'\bsid:\W*(?P<sid>\w+)')
#test_str = "sid:I118uailfriedx151201005423521\">>" # => I118uailfriedx151201005423521
test_str = "sid:<<\"I118uailfriedx151201005423521" # => I118uailfriedx151201005423521
m = p.search(test_str)
if m:
    print(m.group("sid"))

edited Feb 16, 2016 at 15:56

answered Feb 16, 2016 at 15:36

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

8 Comments

Wiktor Stribiżew Over a year ago

This approach will work even if you have these sids inside a larger text.

pm1359 Over a year ago

I really want to use named capturing group and I would like to be able to refer to it later on.

Wiktor Stribiżew Over a year ago

I revamped with a named capture group.

pm1359 Over a year ago

sometimes log includes sid:<<"I145ucaat.kingx151130155814194">> , dot in middle.

Wiktor Stribiżew Over a year ago

Is it the only nonword character that can be Iinside the sid? Use \bsid:\W*(?P<sid>\w+(?:\.\w+)*). If you know exact trailing boundary, you can also match the sid with .*?.

|

Collectives™ on Stack Overflow

Combining two patterns with named capturing group in Python?

3 Answers 3

3 Comments

3 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related