I agree with the other option to filter yourself but I think it's faster if you read the file in chunks, filter the lines you want to keep, and then use one Pandas reader (rather than creating one reader per row):
def read_buffered(fle, keep):
READ_SIZE = 10000
with open(fle) as f:
buff = StringIO()
while True:
readBuffer = f.readlines(READ_SIZE)
if not readBuffer:
break
buff.writelines([x for x in readBuffer if x[0] == keep])
buff.seek(0)
return buff
Then you can pass the returned object to pandas like a file
from_csv = pd.read_csv(read_buffered('test.txt','A'),
sep=' ', header=None, names=['a','s','d','f'])
from_csv.head()
In my tests, this is about twice as fast as the accepted solution (but this likely depends on the fraction of rows that you filter out and if you can fit two copies of your data in memory):
In [128]: timeit pd.read_csv(read_buffered("test.txt","A"), sep=' ', header=None, names=['a','s','d','f'])
10 loops, best of 3: 22 ms per loop
In [129]: timeit read_only_csv("test.txt", "A", 0, sep=" ", columns=['a', 's', 'd', 'f'])
10 loops, best of 3: 45.7 ms per loop
df[df['a'] == 'A']Alines