Extracting data using regex in python

Question

I have a string variable whose data looks something like this:

a:15:{s:6:"status";s:6:"Active";s:9:"checkdate";s:8:"20130807";s:11:"companyname";s:4:"test";s:11:"validdomain";s:19:"test";s:7:"md5hash";s:32:"501yd361fe10644ea1184412c3e89dce";s:7:"regdate";s:10:"2013-08-06";s:14:"registeredname";s:10:"TestName";s:9:"serviceid";s:1:"8";s:11:"nextduedate";s:10:"0000-00-00";s:12:"billingcycle";s:8:"OneTime";s:7:"validip";s:15:"xxx.xxx.xxx.xxx";s:14:"validdirectory";s:5:"/root";s:11:"productname";s:20:"SomeProduct";s:5:"email";s:19:"[email protected]";s:9:"productid";s:1:"1";}

I am trying to extract the quoted data into a dictionary as a key-value pair like so:

{"status":"Active","checkdate":20130807,.............}

I tried extracting it using the following:

tempkeyresults = re.findall('"(.*?)"([^"]+)</\\1>', localdata, flags=re.IGNORECASE)

I'm quite new to regex and I assume what I am trying to query translates to "find and extract all data between " and " and extract it before the next "..." However, this returns and empty string([]). Could someone tell me where I am wrong?

Thanks in advance

Do you have any types other than s (presumably for 'string')? Do you need to validate that the lengths are correct? How are embedded double quotes in the values handled? Is this a standard format? If so, is there not a standard package to extract the information? — Jonathan Leffler
– Jonathan Leffler, Commented Aug 7, 2013 at 7:43

Community · Accepted Answer · 2017-05-23 12:20:23Z

2

How about this?

>>> import re
>>> s = 'a:15:{s:6:"status";s:6:"Active";s:9:"checkdate";s:8:"20130807";s:11:"companyname";s:4:"test";s:11:"validdomain";s:19:"test";s:7:"md5hash";s:32:"501yd361fe10644ea1184412c3e89dce";s:7:"regdate";s:10:"2013-08-06";s:14:"registeredname";s:10:"TestName";s:9:"serviceid";s:1:"8";s:11:"nextduedate";s:10:"0000-00-00";s:12:"billingcycle";s:8:"OneTime";s:7:"validip";s:15:"xxx.xxx.xxx.xxx";s:14:"validdirectory";s:5:"/root";s:11:"productname";s:20:"SomeProduct";s:5:"email";s:19:"[email protected]";s:9:"productid";s:1:"1";}'
>>> results = re.findall('"(\w+)"', s)
>>> dict(zip(*[iter(results)] * 2))
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

\w means "any word character" (letters, numbers, regardless of case, and underscore (_))
+ means 1 or more.
dict(zip(*[iter(results)] * 2)) is very well explained in this answer

edited May 23, 2017 at 12:20

CommunityBot

11 silver badge

answered Aug 7, 2013 at 7:43

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

rahuL Over a year ago

what does (\w+) stand for?

rahuL Over a year ago

And could you explain zip(*[iter(results)] * 2). I am seeing this for the first time..

alecxe Over a year ago

Edited the post, please check.

rahuL Over a year ago

Thank you. It been of great help!

zhangyangyu · Accepted Answer · 2013-08-07 07:57:31Z

1

This one, find all the words surrounding by quotes and then slices the list to mapping:

>>> res = re.findall('"(\w+)"', s)
>>> i = iter(res)
>>> dict(zip(*[i]*2))
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

Or use this one. This will use regex to find all the pairs(adjacent two):

>>> res = re.findall('"(\w+)"(?:.*?)"(\w+)"', s)
>>> res
[('status', 'Active'), ('checkdate', '20130807'), ('companyname', 'test'), ('validdomain', 'test'), ('md5hash', '501yd361fe10644ea1184412c3e89dce'), ('regdate', 'registeredname'), ('TestName', 'serviceid'), ('8', 'nextduedate'), ('billingcycle', 'OneTime'), ('validip', 'validdirectory'), ('productname', 'SomeProduct'), ('email', 'productid')]
>>> dict(res)
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

edited Aug 7, 2013 at 7:57

answered Aug 7, 2013 at 7:44

zhangyangyu

8,6103 gold badges35 silver badges43 bronze badges

5 Comments

rahuL Over a year ago

what does (\w+) stand for?

zhangyangyu Over a year ago

\w means [a-zA-z0-9_]. + means at least one.

zhangyangyu Over a year ago

I post another kind of solution. @i.h4d35

rahuL Over a year ago

one more doubt - if the words have hyphens (-) or commas inbetween, how would we include that in the expression?

zhangyangyu Over a year ago

replace \w with [\w.-].

Brigand · Accepted Answer · 2013-08-07 08:13:08Z

1

You can do it without regular expressions.

parts = s.split('"')[1::2] # get all quoted text in a list
keys, values = parts[::2], parts[1::2] # take even and odd items (keys, values)
results = dict(zip(keys, values)) # turn it into a dict

results:

{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'productid': '1', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': '2013-08-06', 'registeredname': 'TestName', 'email': '[email protected]', 'serviceid': '8', 'nextduedate': '0000-00-00', 'billingcycle': 'OneTime', 'validip': 'xxx.xxx.xxx.xxx', 'productname': 'SomeProduct', 'checkdate': '20130807', 'validdirectory': '/root'}

answered Aug 7, 2013 at 8:13

Brigand

86.4k20 gold badges167 silver badges174 bronze badges

Collectives™ on Stack Overflow

Extracting data using regex in python

3 Answers 3

4 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related