Regex to extract multiple fields from pattern

Question

I have a pattern like this in a txt file:

["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]

and I need a regex to extract each field in python. Every field can contain any character (not only alphanumeric) except for the 4th which is a long number. How can I do it? Many thanks.

EDIT: the file contains other html elements, that's why I can't parse it directly in a python List.

Is that a literal copy-paste of what's in your text file? You could just ast.literal_eval it since it is already a valid python list of strings. — Cory Kramer
– Cory Kramer, Commented Oct 20, 2014 at 14:15
It is but the file in composed by other text. I can't simply import it as a list. — phcaze
– phcaze, Commented Oct 20, 2014 at 14:15
That looks like it's either JSON or Python 3 syntax (based on the unicode escape in the string). So use json.loads or ast.literal_eval respectively. — interjay
– interjay, Commented Oct 20, 2014 at 14:19

Noctis Skytower · Accepted Answer · 2014-10-20 14:33:18Z

The following provides three different options for getting your data:

>>> TEXT = '["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]'
>>> import json, ast, re
>>> json.loads(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> ast.literal_eval(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> re.search(r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]', TEXT).groupdict()
{'website': 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'number2': '1409770337', 'language': 'es', 'data': 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 'number1': '116224357500406255237', 'name': 'kiarix moreno'}
>>>

In particular, your regular expression would be the following: r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]'

vks · Accepted Answer · 2014-10-20 14:18:31Z

0

"([^"]*")|(\d+)

You can try this.Grab the matches.See demo.

http://regex101.com/r/dK1xR4/5

answered Oct 20, 2014 at 14:18

vks

68.1k11 gold badges96 silver badges132 bronze badges

1 Comment

Avinash Raj Over a year ago

It is but the file in composed by other text.

rahul tyagi · Accepted Answer · 2014-10-20 14:21:34Z

0

you can 1)open the file. 2)use getline to scan each line. 3)use split() function to split using "," and then use the resulting tuple/list however you want.

answered Oct 20, 2014 at 14:21

rahul tyagi

6431 gold badge8 silver badges20 bronze badges

Comments

Anzel · Accepted Answer · 2014-10-20 14:38:57Z

0

I'm going to combine re, try/except, ast.literal_eval and file to read all possible elements, also to avoid any [ ] across several lines so readline won't work.

Here is my solution:

import re
import ast

# grab all possible lists in the file
found = re.findall(r'\[.*\]', open('yourfile.txt' ,'r').read())

for each in found:
    try:
        for el in ast.literal_eval(each):
            print el
    except SyntaxError:
        pass


kiarix moreno
116224357500406255237
z120gbkosz2oc3ckv23bc10hhwrudlcjy04
1409770337
com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https
es

answered Oct 20, 2014 at 14:38

Anzel

20.6k5 gold badges54 silver badges53 bronze badges

Collectives™ on Stack Overflow

Regex to extract multiple fields from pattern

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related