1

I have a pattern like this in a txt file:

["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]

and I need a regex to extract each field in python. Every field can contain any character (not only alphanumeric) except for the 4th which is a long number. How can I do it? Many thanks.

EDIT: the file contains other html elements, that's why I can't parse it directly in a python List.

4
  • it seems like a list.. Commented Oct 20, 2014 at 14:14
  • 2
    Is that a literal copy-paste of what's in your text file? You could just ast.literal_eval it since it is already a valid python list of strings. Commented Oct 20, 2014 at 14:15
  • It is but the file in composed by other text. I can't simply import it as a list. Commented Oct 20, 2014 at 14:15
  • That looks like it's either JSON or Python 3 syntax (based on the unicode escape in the string). So use json.loads or ast.literal_eval respectively. Commented Oct 20, 2014 at 14:19

4 Answers 4

1

The following provides three different options for getting your data:

>>> TEXT = '["kiarix moreno","116224357500406255237","z120gbkosz2oc3ckv23bc10hhwrudlcjy04",1409770337,"com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https","es"]'
>>> import json, ast, re
>>> json.loads(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> ast.literal_eval(TEXT)
['kiarix moreno', '116224357500406255237', 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 1409770337, 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'es']
>>> re.search(r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]', TEXT).groupdict()
{'website': 'com.youtube.www/watch?v=p1JPKLa-Ofc:https', 'number2': '1409770337', 'language': 'es', 'data': 'z120gbkosz2oc3ckv23bc10hhwrudlcjy04', 'number1': '116224357500406255237', 'name': 'kiarix moreno'}
>>> 

In particular, your regular expression would be the following: r'\["(?P<name>[^"]*)","(?P<number1>[^"]*)","(?P<data>[^"]*)",(?P<number2>\d*),"(?P<website>[^"]*)","(?P<language>[^"]*)"\]'

Sign up to request clarification or add additional context in comments.

Comments

0
"([^"]*")|(\d+)

You can try this.Grab the matches.See demo.

http://regex101.com/r/dK1xR4/5

1 Comment

It is but the file in composed by other text.
0

you can 1)open the file. 2)use getline to scan each line. 3)use split() function to split using "," and then use the resulting tuple/list however you want.

Comments

0

I'm going to combine re, try/except, ast.literal_eval and file to read all possible elements, also to avoid any [ ] across several lines so readline won't work.

Here is my solution:

import re
import ast

# grab all possible lists in the file
found = re.findall(r'\[.*\]', open('yourfile.txt' ,'r').read())

for each in found:
    try:
        for el in ast.literal_eval(each):
            print el
    except SyntaxError:
        pass


kiarix moreno
116224357500406255237
z120gbkosz2oc3ckv23bc10hhwrudlcjy04
1409770337
com.youtube.www/watch?v\u003dp1JPKLa-Ofc:https
es

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.