5

I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):

product/productId: D7SDF9S9 
review/userId: asdf9uas0d8u9f 
review/score: 5.0 
review/some text here

product/productId: D39F99 
review/userId: fasd9fasd9f9f 
review/score: 4.1 
review/some text here

Each record is separated by two newline charters /n. I have written a parser below.

with open ("largefile.txt", "r") as myfile:
    fullstr = myfile.read()

allsplits = re.split("\n\n",fullstr)

articles = []

for i,s in enumerate(allsplits[0:]):

        splits = re.split("\n.*?: ",s)
        productId = splits[0]
        userId = splits[1]
        profileName = splits[2]
        helpfulness = splits[3]
        rating = splits[4]
        time = splits[5]
        summary = splits[6]
        text = splits[7]

fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")

return 

The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr) line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?

2
  • This looks like something that sed was made for. Commented Feb 9, 2014 at 0:53
  • Do you always have a colon before the data? Your code makes me think so, but your last entry doesn't. And can that last entry (text) encompass multiple lines? Commented Feb 9, 2014 at 1:05

3 Answers 3

8

Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.

The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

This code does assume that the file contains text before a colon of the form category/key, so product/productId, review/userId, etc. The part after the slash is used for the CSV columns; the fields list at the top reflects these keys.

Alternatively, you can remove that fields list and use a csv.writer instead, gathering the record values in a list instead:

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)

This version requires that record fields are all present and are written to the file in a fixed order.

Sign up to request clarification or add additional context in comments.

9 Comments

Hey Thanks! this looks good. I am getting this error when I use this method: "csv.writerow(record); AttributeError: 'module' object has no attribute 'writerow'" DO you know what my problem is?
@user2896837: that was a silly mistake on my part; corrected, it is writer.writerow().
now I'm getting: "File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py", line 153, in writerow return self.writer.writerow(self._dict_to_list(rowdict)) TypeError: 'str' does not support the buffer interface"
Ah, this is Python 3; adjusted the way the outputfile is opened for you.
sorry for so many questions. I'm annoying myself too. I'm getting this error: " File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 4146: ordinal not in range(128)" after I changed the way the output file is open. I've also tried encoding the string using ".encode('utf-8')", but no luck yet. Thanks again for your help and patience.
|
1

Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv module to parse the records:

import csv

with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:

    writer = csv.writer(outfile, delimiter='|')

    for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
        values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
        writer.writerow(values)

A couple things to note here:

  • Use with to open files. Why? Because using with ensures that the file is close()d, even if an exception interrupts the script.

Thus:

with open('myfile.txt') as f:
    do_stuff_to_file(f)

is equivalent to:

f = open('myfile.txt')
try:
    do_stuff_to_file(f)
finally:
    f.close()

To be continued... (I'm out of time ATM)

2 Comments

This won't split off the record keys; you are writing product/productId: D7SDF9S9 instead of D7SDF9S9.
@MartijnPieters: Ah, you're right! I overlooked that part.
0

Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.