Parsing a text file in python and outputting to a CSV

Question

Preface - I'm pretty new to Python, having had more experience in another language.

I have a text file with single column list of strings in the generic (but slightly varying) format "./abc123a1/type/1ab2_x_data_type.file.type"

I need to extract the abc123a1 and the 1ab2 portions from all several hundred of the rows and put them under two columns (column a and b) in a csv. Sometimes there may be a "1ab2_a" and a "1ab2_b", but I only want one 1ab2. So I'd want to grab "1ab2_a" and ignore all others.

I have the regex which I THINK will work:

tmp = list()
if re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x):
    tmp = re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x)
elif re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x):
    tmp = re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x)
if len(tmp) == 0:
    return None
elif len(tmp) > 1:
    print "ERROR found multiple matches"
    return "ERROR"
else:
    return tmp[0].upper()

I am trying to make this script step by step and testing things to make sure it works, but it's just not.

import sys
import csv

listOfData = []

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])
print listOfData

with open('extracted.csv', 'w') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(('column a', 'column b'))
    writer.writerows(listOfData)

print listOfData

Still failing to get anything in the csv other than column headers, much less a parsed version!

Does anyone have any better ideas or formats I could do this in? A friend mentioned looking into glob.glob, but I haven't had luck getting that to work either.

When you print listOfData, does it have the data that you want? — Joseph Stover
– Joseph Stover, Commented Aug 21, 2015 at 15:16
"So I'd want to grab "1ab2_a" and ignore all others." Not sure to well understand this sentence. Do you want to extract 1ab2 or 1ab2_a? — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Aug 21, 2015 at 15:18
Could you edit the question to add some more example input lines? Also add what the expected output for that input would be. — Martin Evans
– Martin Evans, Commented Aug 21, 2015 at 15:21

Serge Ballesta · Accepted Answer · 2015-08-21 15:24:31Z

2

IMHO, you were not far from making it work. The problem is that you read once the whole file just to print the lines, and then (once at end of file) you try to put them into a list... and get an empty list !

You should read the file only once:

import sys
import csv

listOfData = []

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])
print listOfData

with open('extracted.csv', 'w') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(('column a', 'column b'))
    writer.writerows(listOfData)

print listOfData

once it works, you still have to use the regex to get relevant data to put into the csv file

answered Aug 21, 2015 at 15:24

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Anand S Kumar · Accepted Answer · 2015-08-21 15:22:35Z

I am not sure about your regex (it will most probably not work) , but the reason why your current (non-regex , simple) code does not work is because -

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])

As you can see you are first iterating over each line in file and printing it, it should be fine, but after the loop ends, the file pointer is at the end of file, so trying to iterate over it again , would not produce any result. You should only iterate over it once, and do both printing and appending to list in it. Example -

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])

martineau · Accepted Answer · 2015-08-21 15:42:29Z

0

I think at least part of the problem is the two for loops in the following:

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])

The first one prints all the lines of f, so there's nothing left for the second one to iterate over unless you first f.seek(0) and rewind the file.

An alternative way would to simply to this:

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])

It's hard to tell if your regexes are OK without more than one line of sample input data.

edited Aug 21, 2015 at 15:42

answered Aug 21, 2015 at 15:23

martineau

124k29 gold badges181 silver badges319 bronze badges

Comments

Martin Evans · Accepted Answer · 2015-08-21 15:56:27Z

Are you sure you need all of the regular expressions? You seem to be parsing a list of paths and filenames. The path could be split up using a split command, for example:

print "./abc123a1/type/1ab2_a_data_type.file.type".split("/")

Would give:

['.', 'abc123a1', 'type', '1ab2_a_data_type.file.type']

You could then create a set consisting of the second entry and up to the '_' in forth entry, e.g.

('abc123a1', '1ab2')

This could then be used to print only the first entry from each:

pairs = set()

with open(sys.argv[1], 'r') as in_file, open('extracted.csv', 'wb') as out_file:
    writer = csv.writer(out_file)

    for row in in_file:
        folders = row.split("/")
        col_a = folders[1]
        col_b = folders[3].split("_")[0]

        if (col_a, col_b) not in pairs:
            pairs.add((col_a, col_b))
            writer.writerow([col_a, col_b])

So for an input looking like this:

./abc123a1/type/1ab2_a_data_type.file.type
./abc123a1/type/1ab2_b_data_type.file.type
./abc123a2/type/1ab2_a_data_type.file.type
./abc123a3/type/1ab2_a_data_type.file.type

You would get a CSV file looking like:

abc123a1,1ab2
abc123a2,1ab2
abc123a3,1ab2

Collectives™ on Stack Overflow

Parsing a text file in python and outputting to a CSV

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related