Parsing large (9GB) file using python

Question

I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):

product/productId: D7SDF9S9 
review/userId: asdf9uas0d8u9f 
review/score: 5.0 
review/some text here

product/productId: D39F99 
review/userId: fasd9fasd9f9f 
review/score: 4.1 
review/some text here

Each record is separated by two newline charters /n. I have written a parser below.

with open ("largefile.txt", "r") as myfile:
    fullstr = myfile.read()

allsplits = re.split("\n\n",fullstr)

articles = []

for i,s in enumerate(allsplits[0:]):

        splits = re.split("\n.*?: ",s)
        productId = splits[0]
        userId = splits[1]
        profileName = splits[2]
        helpfulness = splits[3]
        rating = splits[4]
        time = splits[5]
        summary = splits[6]
        text = splits[7]

fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")

return

The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr) line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?

Do you always have a colon before the data? Your code makes me think so, but your last entry doesn't. And can that last entry (text) encompass multiple lines? — DSM
– DSM, Commented Feb 9, 2014 at 1:05

Martijn Pieters · Accepted Answer · 2014-02-09 01:30:33Z

8

Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.

The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

This code does assume that the file contains text before a colon of the form category/key, so product/productId, review/userId, etc. The part after the slash is used for the CSV columns; the fields list at the top reflects these keys.

Alternatively, you can remove that fields list and use a csv.writer instead, gathering the record values in a list instead:

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)

This version requires that record fields are all present and are written to the file in a fixed order.

edited Feb 9, 2014 at 1:30

answered Feb 9, 2014 at 1:00

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user2896837 Over a year ago

Hey Thanks! this looks good. I am getting this error when I use this method: "csv.writerow(record); AttributeError: 'module' object has no attribute 'writerow'" DO you know what my problem is?

Martijn Pieters Over a year ago

@user2896837: that was a silly mistake on my part; corrected, it is writer.writerow().

user2896837 Over a year ago

now I'm getting: "File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/csv.py", line 153, in writerow return self.writer.writerow(self._dict_to_list(rowdict)) TypeError: 'str' does not support the buffer interface"

Martijn Pieters Over a year ago

Ah, this is Python 3; adjusted the way the outputfile is opened for you.

user2896837 Over a year ago

sorry for so many questions. I'm annoying myself too. I'm getting this error: " File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 4146: ordinal not in range(128)" after I changed the way the output file is open. I've also tried encoding the string using ".encode('utf-8')", but no luck yet. Thanks again for your help and patience.

|

Joel Cornett · Accepted Answer · 2014-02-09 01:04:34Z

1

Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv module to parse the records:

import csv

with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:

    writer = csv.writer(outfile, delimiter='|')

    for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
        values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
        writer.writerow(values)

A couple things to note here:

Use with to open files. Why? Because using with ensures that the file is close()d, even if an exception interrupts the script.

Thus:

with open('myfile.txt') as f:
    do_stuff_to_file(f)

is equivalent to:

f = open('myfile.txt')
try:
    do_stuff_to_file(f)
finally:
    f.close()

To be continued... (I'm out of time ATM)

edited Feb 9, 2014 at 1:04

answered Feb 9, 2014 at 0:57

Joel Cornett

24.8k9 gold badges69 silver badges90 bronze badges

2 Comments

Martijn Pieters Over a year ago

This won't split off the record keys; you are writing product/productId: D7SDF9S9 instead of D7SDF9S9.

Joel Cornett Over a year ago

@MartijnPieters: Ah, you're right! I overlooked that part.

user3068202 · Accepted Answer · 2014-02-09 01:00:22Z

0

Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.

edited Feb 9, 2014 at 1:00

answered Feb 9, 2014 at 0:53

user3068202

556 bronze badges

Collectives™ on Stack Overflow

Parsing large (9GB) file using python

3 Answers 3

9 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related