1

Say I have a text file like this:

A 12 16 91
A 22 56 31
A 17 25 22
B 34,543,683,123 34
A 19 27 32
B 45,48,113,523 64
A 11 24 72
C asd,asd,qwe ewr 123

Using Pandas csv_read I can:

from_csv = pd.read_csv('test.txt', sep=' ', header=None, names=['a','s','d','f'])
from_csv.head()

Which works fine if the rows starting with B or C aren't there.

How can I tell read_csv read only the lines starting with A?

4
  • 1
    No you have to filter as a post processing step, so df[df['a'] == 'A'] Commented Jan 12, 2016 at 10:04
  • Do you want t skip the lines starting with B or get all the lines starting with A? Commented Jan 12, 2016 at 10:12
  • @PadraicCunningham Ideally skip all lines except for the A lines Commented Jan 12, 2016 at 10:12
  • 1
    You should also consider whether filtering at this step or as a post processing step is better in terms of usefulness and performance. Checking every line in a csv prior to loading will be slow rather than just loading and then filtering depending on the size of the file Commented Jan 12, 2016 at 10:41

2 Answers 2

1

I agree with the other option to filter yourself but I think it's faster if you read the file in chunks, filter the lines you want to keep, and then use one Pandas reader (rather than creating one reader per row):

def read_buffered(fle, keep):
    READ_SIZE = 10000
    with open(fle) as f:
        buff = StringIO()
        while True: 
            readBuffer = f.readlines(READ_SIZE)
            if not readBuffer:
                break
            buff.writelines([x for x in readBuffer if x[0] == keep])
    buff.seek(0)
    return buff

Then you can pass the returned object to pandas like a file

from_csv = pd.read_csv(read_buffered('test.txt','A'), 
     sep=' ', header=None, names=['a','s','d','f'])
from_csv.head()

In my tests, this is about twice as fast as the accepted solution (but this likely depends on the fraction of rows that you filter out and if you can fit two copies of your data in memory):

In [128]: timeit pd.read_csv(read_buffered("test.txt","A"), sep=' ', header=None, names=['a','s','d','f'])
10 loops, best of 3: 22 ms per loop

In [129]: timeit read_only_csv("test.txt", "A", 0, sep=" ", columns=['a', 's', 'd', 'f'])                     
10 loops, best of 3: 45.7 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

0

You could do the filtering yourself:

import pandas as pd
import csv


def read_only_csv(fle, keep, col,sep=",", **kwargs):
    with open(fle) as f:
        return pd.DataFrame.from_records((r for r in csv.reader(f, delimiter=sep) if r[col] == keep),
                                       **kwargs)



df = read_only_csv("test.txt", "A", 0, sep=" ",columns=['a', 's', 'd', 'f'])

Which would give you:

  a   s   d   f
0  A  12  16  91
1  A  22  56  31
2  A  17  25  22
3  A  19  27  32
4  A  11  24  7

For a file with ~80k lines, using read_csv and then filtering is still faster, the only advantage is you won't use as much memory.

In [24]: %%timeit     
df = pd.read_csv('out.txt', sep=' ', header=None, names=['a','s','d','f'])
df = df[df["a"] == "A"]
   ....: 
10 loops, best of 3: 31.8 ms per loop

In [25]: timeit read_only_csv("out.txt", "A", 0, sep=" ", columns=['a', 's', 'd', 'f'])

10 loops, best of 3: 41.1 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.