simple regex for following string

Question

I have a string which looks like

rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036

Now, what I want to do is

extract timestamp: 134049600
        event: EP002960010145

Now the isseue is there is %3D after tmsid I dont even know what it is.. but anyways, sometimes its %3D %6D and I think it can be even %16D ??? I cant be sure about that

Is there a robust way to handle these two fields from the above string?

THanks

Martijn Pieters · Accepted Answer · 2013-03-27 20:49:20Z

You are looking at URL-quoted data:

>>> from urllib2 import unquote
>>> unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036')
'rand_id:?tmsid=1340496000_EP002960010145_11_0_10050_1_2_10036'

You can split on the first = perhaps, then split on _:

>>> unquoted = unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036')
>>> unquoted.split('=', 1)[1].split('_')
['1340496000', 'EP002960010145', '11', '0', '10050', '1', '2', '10036']
>>> timestamp, event = unquoted.split('=', 1)[1].split('_')[:2]
>>> timestamp, event
('1340496000', 'EP002960010145')

If instead the data has multiple fields and you find a & in there too, you can perhaps better parse everything after the question mark as a URL query string instead using urlparse.parse_qs()

>>> from urlparse import parse_qs
>>> parse_qs(unquoted.split('?', 1)[1])
{'tmsid': ['1340496000_EP002960010145_11_0_10050_1_2_10036']}
>>> parsed = parse_qs(unquoted.split('?', 1)[1])
>>> timestamp, event = parsed['tmsid'][0].split('_', 2)[:2]
>>> timestamp, event
('1340496000', 'EP002960010145')

Collectives™ on Stack Overflow

simple regex for following string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related