0

I have a string which looks like

rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036 

Now, what I want to do is

extract timestamp: 134049600
        event: EP002960010145

Now the isseue is there is %3D after tmsid I dont even know what it is.. but anyways, sometimes its %3D %6D and I think it can be even %16D ??? I cant be sure about that

Is there a robust way to handle these two fields from the above string?

THanks

1 Answer 1

3

You are looking at URL-quoted data:

>>> from urllib2 import unquote
>>> unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036')
'rand_id:?tmsid=1340496000_EP002960010145_11_0_10050_1_2_10036'

You can split on the first = perhaps, then split on _:

>>> unquoted = unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036')
>>> unquoted.split('=', 1)[1].split('_')
['1340496000', 'EP002960010145', '11', '0', '10050', '1', '2', '10036']
>>> timestamp, event = unquoted.split('=', 1)[1].split('_')[:2]
>>> timestamp, event
('1340496000', 'EP002960010145')

If instead the data has multiple fields and you find a & in there too, you can perhaps better parse everything after the question mark as a URL query string instead using urlparse.parse_qs()

>>> from urlparse import parse_qs
>>> parse_qs(unquoted.split('?', 1)[1])
{'tmsid': ['1340496000_EP002960010145_11_0_10050_1_2_10036']}
>>> parsed = parse_qs(unquoted.split('?', 1)[1])
>>> timestamp, event = parsed['tmsid'][0].split('_', 2)[:2]
>>> timestamp, event
('1340496000', 'EP002960010145')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.