3

I have a regular expression which uses the before pattern like so:

>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:)([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:I118uailfriedx151201005423521">>')
>>> x.group('sid')
'I118uailfriedx151201005423521'

and another like so:

>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:<<")([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:<<"I118uailfriedx151201005423521')
>>> x.group('sid')
'I118uailfriedx151201005423521'

How can I combine these two patterns in a way that, after parsing these two different lines,:

sid:A111uancalual2626x151130185758596
sid:<<"I118uailfriedx151201005423521">>

returns only the corresponding id to me.

2
  • RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))') I'm not sure if this work, but could you try to move out the sid: part before the variable? Commented Feb 16, 2016 at 15:27
  • @zolo your solution seems to be worked. If you like to write your solution as answer would free to do that. I will appreciate for complete explanation. Especially first part of your code, I am not sure if I understood it or not? why doesn't have ?p for the first part. Commented Feb 16, 2016 at 15:40

3 Answers 3

1

RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))')

Use this, I've just tested and it is working for me. I've moved some part out.

Sign up to request clarification or add additional context in comments.

3 Comments

@MaryamPashmi: This (<<")? is an optional group. <<" can be there, or it can go missing, a match will still be found. BTW, I'd use a non-capturing group (in order not to interfere with the re.findall): r'sid:(?:<<")?(?P<sid>[A-Za-z0-9]+)'. However, what if you have >>" or >> or just > there, in between? My solution will work, this one won't.
@zolo if u can explain it clearly will perfect. I have comment in above for u. I don't get this part sid:(<<")?(?P<sid> . In my solution I mean by ?P<sid> , just a name to be able to refer to it later.
I've seen (?P<variable_name>.*) style previously in another place (splunk). The statement just used for name the some part of the match. This way you may use multiple variables after each other, and not necessary to put every part of the line to the different variables. Therefore if you want to put the "sid:" part inside the variable definition, you have to use look behind assertion, which makes impossible to use (in most languages) the "?" "*" "+" and such. It looked easier to put the sid part out and I can even use the "?". I didn't changed/optimized your regexp in any way however.
0

Instead of tweaking your regex, you can make your strings easier to parse by just removing any characters except alphanumeric and a colon. Then, just split by colon and get the last item:

>>> import re
>>> 
>>> test_strings = ['sid:I118uailfriedx151201005423521">>', 'sid:<<"I118uailfriedx151201005423521']
>>> pattern = re.compile(r"[^A-Za-z0-9:]")
>>> for test_string in test_strings:
...     print(pattern.sub("", test_string).split(":")[-1])
... 
I118uailfriedx151201005423521
I118uailfriedx151201005423521

3 Comments

I really prefer to use my solution with using 'or' between my pattern, I have a log of 10000 lines and I need to parse lots of information,I am not sure if your solution would works for me,
@MaryamPashmi why are you "not sure". Giving the provided input samples, it works and, I think the solution is quite simple.
@MaryamPashmi but yeah, if you want to run findall against the complete log file, then you should probably go with a single regex expression. Thanks.
0

You can achieve what you want with a single regex:

\bsid:\W*(?P<sid>\w+)

See the regex demo

The regex breakdown:

  • \bsid - whole word sid
  • : - a literal colon
  • \W* - zero or more non-word characters
  • (?P<sid>\w+) - one or more word characters captured into a group named "sid"

Python demo:

import re
p = re.compile(r'\bsid:\W*(?P<sid>\w+)')
#test_str = "sid:I118uailfriedx151201005423521\">>" # => I118uailfriedx151201005423521
test_str = "sid:<<\"I118uailfriedx151201005423521" # => I118uailfriedx151201005423521
m = p.search(test_str)
if m:
    print(m.group("sid"))

8 Comments

This approach will work even if you have these sids inside a larger text.
I really want to use named capturing group and I would like to be able to refer to it later on.
I revamped with a named capture group.
sometimes log includes sid:<<"I145ucaat.kingx151130155814194">> , dot in middle.
Is it the only nonword character that can be Iinside the sid? Use \bsid:\W*(?P<sid>\w+(?:\.\w+)*). If you know exact trailing boundary, you can also match the sid with .*?.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.