Python analyzing text and extracting with regex

Question

Would be great if you could help a python beginner, thx for reading!

I want to analyze a textdocument which is formated like this and has a large amount of lines like this:

000001  A040C015_130223_R1WV             V     C        11:37:48:22 11:38:29:18 10:00:53:00 10:01:33:20

between every string there are whitespaces. So I did following:

    #writing data into list
datalist = []
filedata = open(inputfile, 'r')
for line in filedata:
    line = line.strip('\n\t\r')
    datalist.append(line)

filedata.close()

#splitting up List by whitespace and creating new List
newList = []
for i in datalist:
    newList.append(i.split(' '))


print newList[0:]

#parsing list with regex
regCompiled = re.compile('^[A-Z][0-9]{3,3}[C][0-9]{3,3}[_][0-9]{6,6}[_][A-Z][0-9]{2,2}[A-Z].*');

for content in newList:
    checkMatch = re.match(regCompiled, content);    
    if checkMatch:
        print ("Found:"), content
    else:
        print ("NO Match")

First problem I have is, that it seems it makes for every line a list with empty ('') items for every whitespace after splitting, and then it seems like it is a list in a list because of the split function.

i tried with

filter(None, newList)

but the ('') items are remaining and an error with regex because of empty items. After all I want extract only the strings containing the A040C015_etc.

The full textlist is here: Link to full Text Document

Thank you very much for any help... rainer

it should find exactly this format of string: A040C015_130223_R1WV — rainer
– rainer, Commented Feb 20, 2014 at 11:36

WeaselFox · Accepted Answer · 2014-02-20 11:30:37Z

1

try using split() instead of split(" "). that should take care of the extra space:

>>> i = "x  X"
>>> i.split()
['x', 'X']
>>> i.split(" ")
['x', '', 'X']

answered Feb 20, 2014 at 11:30

WeaselFox

7,3989 gold badges53 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Esenti Over a year ago

Link to documentation explaining this behaviour.

rainer Over a year ago

list output is now: ['000001', 'A040C015_130223_R1WV', 'V', 'C', '11:37:48:22', '11:38:29:18', '10:00:53:00', '10:01:33:20'], ['000002', 'A038C015_130223_R1WV', 'V', 'C', '05:19:31:20', '05:20:19:07', '10:01:33:20', '10:02:21:07'] and so on. The list in a list thing is remaining...

WeaselFox Over a year ago

what is it you want, exactly? a single list with all the split fields?

rainer Over a year ago

at the end i want a list with the strings containing A040C015_130223_R1WV format only. All the fields with numbers and single chars are not needed.

WeaselFox Over a year ago

well to get to one list from your list of lists replace newList.append(i.split(' ')) with newList.extend(i.split(' '))

Collectives™ on Stack Overflow

Python analyzing text and extracting with regex

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related