3

Preface - I'm pretty new to Python, having had more experience in another language.

I have a text file with single column list of strings in the generic (but slightly varying) format "./abc123a1/type/1ab2_x_data_type.file.type"

I need to extract the abc123a1 and the 1ab2 portions from all several hundred of the rows and put them under two columns (column a and b) in a csv. Sometimes there may be a "1ab2_a" and a "1ab2_b", but I only want one 1ab2. So I'd want to grab "1ab2_a" and ignore all others.

I have the regex which I THINK will work:

tmp = list()
if re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x):
    tmp = re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x)
elif re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x):
    tmp = re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x)
if len(tmp) == 0:
    return None
elif len(tmp) > 1:
    print "ERROR found multiple matches"
    return "ERROR"
else:
    return tmp[0].upper()

I am trying to make this script step by step and testing things to make sure it works, but it's just not.

import sys
import csv

listOfData = []

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])
print listOfData

with open('extracted.csv', 'w') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(('column a', 'column b'))
    writer.writerows(listOfData)

print listOfData

Still failing to get anything in the csv other than column headers, much less a parsed version!

Does anyone have any better ideas or formats I could do this in? A friend mentioned looking into glob.glob, but I haven't had luck getting that to work either.

3
  • When you print listOfData, does it have the data that you want? Commented Aug 21, 2015 at 15:16
  • "So I'd want to grab "1ab2_a" and ignore all others." Not sure to well understand this sentence. Do you want to extract 1ab2 or 1ab2_a? Commented Aug 21, 2015 at 15:18
  • Could you edit the question to add some more example input lines? Also add what the expected output for that input would be. Commented Aug 21, 2015 at 15:21

4 Answers 4

2

IMHO, you were not far from making it work. The problem is that you read once the whole file just to print the lines, and then (once at end of file) you try to put them into a list... and get an empty list !

You should read the file only once:

import sys
import csv

listOfData = []

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])
print listOfData

with open('extracted.csv', 'w') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(('column a', 'column b'))
    writer.writerows(listOfData)

print listOfData

once it works, you still have to use the regex to get relevant data to put into the csv file

Sign up to request clarification or add additional context in comments.

Comments

0

I am not sure about your regex (it will most probably not work) , but the reason why your current (non-regex , simple) code does not work is because -

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])

As you can see you are first iterating over each line in file and printing it, it should be fine, but after the loop ends, the file pointer is at the end of file, so trying to iterate over it again , would not produce any result. You should only iterate over it once, and do both printing and appending to list in it. Example -

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])

Comments

0

I think at least part of the problem is the two for loops in the following:

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])

The first one prints all the lines of f, so there's nothing left for the second one to iterate over unless you first f.seek(0) and rewind the file.

An alternative way would to simply to this:

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])

It's hard to tell if your regexes are OK without more than one line of sample input data.

Comments

0

Are you sure you need all of the regular expressions? You seem to be parsing a list of paths and filenames. The path could be split up using a split command, for example:

print "./abc123a1/type/1ab2_a_data_type.file.type".split("/")

Would give:

['.', 'abc123a1', 'type', '1ab2_a_data_type.file.type']

You could then create a set consisting of the second entry and up to the '_' in forth entry, e.g.

('abc123a1', '1ab2')

This could then be used to print only the first entry from each:

pairs = set()

with open(sys.argv[1], 'r') as in_file, open('extracted.csv', 'wb') as out_file:
    writer = csv.writer(out_file)

    for row in in_file:
        folders = row.split("/")
        col_a = folders[1]
        col_b = folders[3].split("_")[0]

        if (col_a, col_b) not in pairs:
            pairs.add((col_a, col_b))
            writer.writerow([col_a, col_b])

So for an input looking like this:

./abc123a1/type/1ab2_a_data_type.file.type
./abc123a1/type/1ab2_b_data_type.file.type
./abc123a2/type/1ab2_a_data_type.file.type
./abc123a3/type/1ab2_a_data_type.file.type

You would get a CSV file looking like:

abc123a1,1ab2
abc123a2,1ab2
abc123a3,1ab2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.