Reading variable column and row structure to Pandas by column amount

Question

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.

Raw data looks like this:

2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0

On one line or several, line breaks are ignored.

End result looks like this, if number of columns is three:

[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]

Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.

Currently I split the file into one big list and split it into rows:

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

(code is from itertools examples)

Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.

How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?

Or is there an other way of reading data into Pandas without auxiliary data structures?

unutbu · Accepted Answer · 2013-04-09 13:41:20Z

1

Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:

import pandas as pd
import re

filename = 'raw_data.csv'
class FileLike(file):
    """ Modeled after FileWrapper
    http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
    """
    def __init__(self, *args):
        super(FileLike, self).__init__(*args)
        self.buffer = []
    def next(self):
        if not self.buffer:
            line = super(FileLike, self).next()
            self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
        if self.buffer:
            line = self.buffer.pop()
            return line

with FileLike(filename, 'r') as f:
    df = pd.read_table(f, header=None, delimiter='\s+')
    print(len(df))

When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.

If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):

import fileinput
import sys
import re

filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
    for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
        print(part)

then you can of course load the data normally into Pandas using pd.read_table:

with open(filename, 'r') as f:
    df = pd.read_table(f, header=None, delimiter='\s+')
    print(len(df))

The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.

So, it appears you will be better off rewriting your data to disk first.

answered Apr 9, 2013 at 13:41

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jarpineh Over a year ago

Thank you for the idea. My approach and @andy-hayden's answer both fail on my big test file, so I fear that read_table won't fare any better even if the data is in easier format. I have to test it properly when I manage to get the conversion working. Our file format allows for line breaks in odd places, so looping through the file by linebreaks is not guaranteed to work.

Andy Hayden · Accepted Answer · 2013-04-09 10:29:55Z

0

I don't think there is a way to seperate rows with the same delimiter as columns.

One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:

s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))

In your example:

In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)

In [2]: s
Out[2]: 
0    2008231
1    4891866
2     383842
3    2036693
4    4924388
5     375170
Name: 0, dtype: float64

In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]: 
         0        1       2
0  2008231  4891866  383842
1  2036693  4924388  375170

answered Apr 9, 2013 at 10:29

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

3 Comments

jarpineh Over a year ago

Well, your solution works, thank you. Unfortunately Pandas explodes my memory in the series creation phase using my 500MB file from hell. With smaller datasets this approach might work better than my solution, but I really have to do some profiling first.

Andy Hayden Over a year ago

@jarpineh which version of pandas are you using?

jarpineh Over a year ago

I have 0.10.0, which is current one from Pypi.

Collectives™ on Stack Overflow

Reading variable column and row structure to Pandas by column amount

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related