Text File Parsing with Python

Question

I am trying to parse a series of text files and save them as CSV files using Python (2.7.3). All text files have a 4 line long header which needs to be stripped out. The data lines have various delimiters including " (quote), - (dash), : column, and blank space. I found it a pain to code it in C++ with all these different delimiters, so I decided to try it in Python hearing it is relatively easier to do compared to C/C++.

I wrote a piece of code to test it for a single line of data and it works, however, I could not manage to make it work for the actual file. For parsing a single line I was using the text object and "replace" method. It looks like my current implementation reads the text file as a list, and there is no replace method for the list object.

Being a novice in Python, I got stuck at this point. Any input would be appreciated!

Thanks!

# function for parsing the data
def data_parser(text, dic):
for i, j in dic.iteritems():
    text = text.replace(i,j)
return text

# open input/output files

inputfile = open('test.dat')
outputfile = open('test.csv', 'w')

my_text = inputfile.readlines()[4:] #reads to whole text file, skipping first 4 lines


# sample text string, just for demonstration to let you know how the data looks like
# my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'

# dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected
reps = {'"NAN"':'NAN', '"':'', '0-':'0,','1-':'1,','2-':'2,','3-':'3,','4-':'4,','5-':'5,','6-':'6,','7-':'7,','8-':'8,','9-':'9,', ' ':',', ':':',' }

txt = data_parser(my_text, reps)
outputfile.writelines(txt)

inputfile.close()
outputfile.close()

You should attach a copy of the file you need to parse and the expected output, that way it will be easier to help you. — Diego Allen
– Diego Allen, Commented Aug 13, 2012 at 15:16

Christopher Bottoms · Accepted Answer · 2017-04-26 11:36:01Z

19

I would use a for loop to iterate over the lines in the text file:

for line in my_text:
    outputfile.writelines(data_parser(line, reps))

If you want to read the file line-by-line instead of loading the whole thing at the start of the script you could do something like this:

inputfile = open('test.dat')
outputfile = open('test.csv', 'w')

# sample text string, just for demonstration to let you know how the data looks like
# my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'

# dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected
reps = {'"NAN"':'NAN', '"':'', '0-':'0,','1-':'1,','2-':'2,','3-':'3,','4-':'4,','5-':'5,','6-':'6,','7-':'7,','8-':'8,','9-':'9,', ' ':',', ':':',' }

for i in range(4): inputfile.next() # skip first four lines
for line in inputfile:
    outputfile.writelines(data_parser(line, reps))

inputfile.close()
outputfile.close()

edited Apr 26, 2017 at 11:36

Christopher Bottoms

11.4k11 gold badges60 silver badges107 bronze badges

answered Aug 13, 2012 at 15:03

Joe Day

7,3744 gold badges27 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

marillion Over a year ago

thanks! what would be the best way to skip the first 4 lines then? To admit, I could not find a way to do it, that's why I decided to read the whole thing. Should I write the file except the first 4 lines to another file to run the loop you have above? I bet there should be an easier way though. EDIT: oh wait, I think you mean replacing the line txt = data_parser(my_text, reps) with the loop you have above.

Joe Day Over a year ago

You've already skipped the first 4 lines with the line my_text = inputfile.readlines()[4:], if you would rather read the file line-by-line and not load the whole thing in to ram at the beginning of the script I can update my answer.

marillion Over a year ago

Sorry, I got it wrong at the first place (see my EDIT above), but thanks, it works perfectly!!! Now, I would be very glad to learn about the "read line-parse-write line (line-by-line)" way of doing things. I have some files large file with a size of +500MB, which may mess up things. Could you update your answer with a second example?

Joe Day Over a year ago

I updated my answer with a version that reads the input file a line at a time.

marillion Over a year ago

Greatly appreciated, thank you! for i in range(4): inputfile.next() was what I was looking for before deciding to read the whole thing by the way!

DSM · Accepted Answer · 2012-08-13 15:24:40Z

11

From the accepted answer, it looks like your desired behaviour is to turn

skip 0
skip 1
skip 2
skip 3
"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636

into

2012,06,23,03,09,13.23,4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,NAN,-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636

If that's right, then I think something like

import csv

with open("test.dat", "rb") as infile, open("test.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=False)
    for i, line in enumerate(reader):
        if i < 4: continue
        date = line[0].split()
        day = date[0].split('-')
        time = date[1].split(':')
        newline = day + time + line[1:]
        writer.writerow(newline)

would be a little simpler than the reps stuff.

answered Aug 13, 2012 at 15:24

DSM

355k67 gold badges606 silver badges504 bronze badges

3 Comments

marillion Over a year ago

I tried using the csv module before coming up with the reps bit, but found the documentation a little bit confusing. Your example makes it much clear. I will try this, just for the sake of learning too. 1. do you eliminate quotes in the text file by quoting=False? 2. could you verify my understanding? date line in the code splits the date portion first and becomes a list by itself, day and time are splitted next, and rest of the line is appended to the day and time. I am not sure how it automatically adds commas though, in your newline = day + time + line[1] line. Hmm...

DSM Over a year ago

@marillon: (1) Yes, there are lots of different quote options. I think it's a little strange to get rid of them all, actually, but maybe you need that for some reason. (2) Yep. Commas aren't added in newline -- that's just a list. writerow is the writer method which adds commas to the output string (or tabs or any other delimiter we wanted) and would handle quoting if we wanted that.

marillion Over a year ago

Ok, I think I got it. Plus, you never needed to parse the data portion of the line at all, since it was already comma separated. Good information, thanks a lot!

Julian · Accepted Answer · 2012-08-13 15:11:08Z

There are a few ways to go about this. One option would be to use inputfile.read() instead of inputfile.readlines() - you'd need to write separate code to strip the first four lines, but if you want the final output as a single string anyway, this might make the most sense.

A second, simpler option would be to rejoin the strings after striping the first four lines with my_text = ''.join(my_text). This is a little inefficient, but if speed isn't a major concern, the code will be simplest.

Finally, if you actually want the output as a list of strings instead of a single string, you can just modify your data parser to iterate over the list. That might looks something like this:

def data_parser(lines, dic):
    for i, j in dic.iteritems():
        for (k, line) in enumerate(lines):
            lines[k] = line.replace(i, j)
    return lines

Pioneer_11 · Accepted Answer · 2022-07-07 09:46:28Z

1

Not directly related but I would heavily encourage you to use with open(file) as x in place of file.open() and file.close() statements. Not only is this more pythonic but it both eliminates the risk of forgetting or accidentally removing the file.close() statement and automagically closes the file in the event of a crash. Overall it's easier to read and way more tolerant of errors.

answered Jul 7, 2022 at 9:46

Pioneer_11

1,4412 gold badges12 silver badges34 bronze badges

1 Comment

toki Over a year ago

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review

Collectives™ on Stack Overflow

Text File Parsing with Python

4 Answers 4

5 Comments

3 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related