1

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.

Raw data looks like this:

2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0

On one line or several, line breaks are ignored.

End result looks like this, if number of columns is three:

[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]

Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.

Currently I split the file into one big list and split it into rows:

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

(code is from itertools examples)

Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.

How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?

Or is there an other way of reading data into Pandas without auxiliary data structures?

2 Answers 2

1

Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:

import pandas as pd
import re

filename = 'raw_data.csv'
class FileLike(file):
    """ Modeled after FileWrapper
    http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
    """
    def __init__(self, *args):
        super(FileLike, self).__init__(*args)
        self.buffer = []
    def next(self):
        if not self.buffer:
            line = super(FileLike, self).next()
            self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
        if self.buffer:
            line = self.buffer.pop()
            return line

with FileLike(filename, 'r') as f:
    df = pd.read_table(f, header=None, delimiter='\s+')
    print(len(df))

When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.

If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):

import fileinput
import sys
import re

filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
    for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
        print(part)

then you can of course load the data normally into Pandas using pd.read_table:

with open(filename, 'r') as f:
    df = pd.read_table(f, header=None, delimiter='\s+')
    print(len(df))

The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.

So, it appears you will be better off rewriting your data to disk first.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the idea. My approach and @andy-hayden's answer both fail on my big test file, so I fear that read_table won't fare any better even if the data is in easier format. I have to test it properly when I manage to get the conversion working. Our file format allows for line breaks in odd places, so looping through the file by linebreaks is not guaranteed to work.
0

I don't think there is a way to seperate rows with the same delimiter as columns.

One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:

s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))

In your example:

In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)

In [2]: s
Out[2]: 
0    2008231
1    4891866
2     383842
3    2036693
4    4924388
5     375170
Name: 0, dtype: float64

In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]: 
         0        1       2
0  2008231  4891866  383842
1  2036693  4924388  375170

3 Comments

Well, your solution works, thank you. Unfortunately Pandas explodes my memory in the series creation phase using my 500MB file from hell. With smaller datasets this approach might work better than my solution, but I really have to do some profiling first.
@jarpineh which version of pandas are you using?
I have 0.10.0, which is current one from Pypi.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.