0

Would be great if you could help a python beginner, thx for reading!

I want to analyze a textdocument which is formated like this and has a large amount of lines like this:

000001  A040C015_130223_R1WV             V     C        11:37:48:22 11:38:29:18 10:00:53:00 10:01:33:20

between every string there are whitespaces. So I did following:

    #writing data into list
datalist = []
filedata = open(inputfile, 'r')
for line in filedata:
    line = line.strip('\n\t\r')
    datalist.append(line)

filedata.close()

#splitting up List by whitespace and creating new List
newList = []
for i in datalist:
    newList.append(i.split(' '))


print newList[0:]

#parsing list with regex
regCompiled = re.compile('^[A-Z][0-9]{3,3}[C][0-9]{3,3}[_][0-9]{6,6}[_][A-Z][0-9]{2,2}[A-Z].*');

for content in newList:
    checkMatch = re.match(regCompiled, content);    
    if checkMatch:
        print ("Found:"), content
    else:
        print ("NO Match")

First problem I have is, that it seems it makes for every line a list with empty ('') items for every whitespace after splitting, and then it seems like it is a list in a list because of the split function.

i tried with

filter(None, newList)

but the ('') items are remaining and an error with regex because of empty items. After all I want extract only the strings containing the A040C015_etc.

The full textlist is here: Link to full Text Document

Thank you very much for any help... rainer

2
  • what is that regex supposed to find exactly? Commented Feb 20, 2014 at 11:31
  • it should find exactly this format of string: A040C015_130223_R1WV Commented Feb 20, 2014 at 11:36

1 Answer 1

1

try using split() instead of split(" "). that should take care of the extra space:

>>> i = "x  X"
>>> i.split()
['x', 'X']
>>> i.split(" ")
['x', '', 'X']
Sign up to request clarification or add additional context in comments.

5 Comments

Link to documentation explaining this behaviour.
list output is now: ['000001', 'A040C015_130223_R1WV', 'V', 'C', '11:37:48:22', '11:38:29:18', '10:00:53:00', '10:01:33:20'], ['000002', 'A038C015_130223_R1WV', 'V', 'C', '05:19:31:20', '05:20:19:07', '10:01:33:20', '10:02:21:07'] and so on. The list in a list thing is remaining...
what is it you want, exactly? a single list with all the split fields?
at the end i want a list with the strings containing A040C015_130223_R1WV format only. All the fields with numbers and single chars are not needed.
well to get to one list from your list of lists replace newList.append(i.split(' ')) with newList.extend(i.split(' '))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.