Conditional row read of csv in pandas

Question

I have large CSVs where I'm only interested in a subset of the rows. In particular, I'd like to read in all the rows which occur before a particular condition is met.

For example, if read_csv would yield the dataframe:

     A    B      C
1   34   3.20   'b'
2   24   9.21   'b'
3   34   3.32   'c'
4   24   24.3   'c'
5   35   1.12   'a'
... 
1e9 42   2.15   'd'

is there some way to read all the rows in the csv until col B exceeds 10. In the above example, I'd like to read in:

     A    B      C
1   34   3.20   'b'
2   24   9.21   'b'
3   34   3.32   'c'
4   24   24.3   'c'

I know how to throw these rows out once I've read the dataframe in, but at this point I've already spent all that computation reading them in. I do not have access to the index of the final row before reading the csv (no skipfooter please)

I don't think there's a straightforward way to do this in the Pandas API. You'll probably just have to break out csv, grab the rows one at a time, stuff them in a list of lists, stop once you get the last row that you want, and then build a DataFrame out of the resulting list of lists. — user554546
– user554546, Commented Jan 30, 2015 at 15:52
You could read the csv in chunks and only append if the subset meets your condition — EdChum
– EdChum, Commented Jan 30, 2015 at 15:58

unutbu · Accepted Answer · 2015-01-30 18:18:59Z

26

You could read the csv in chunks. Since pd.read_csv will return an iterator when the chunksize parameter is specified, you can use itertools.takewhile to read only as many chunks as you need, without reading the whole file.

import itertools as IT
import pandas as pd

chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks)
df = pd.concat(chunks)
mask = df['B'] < 10
df = df.loc[mask]

Or, to avoid having to use df.loc[mask] to remove unwanted rows from the last chunk, perhaps a cleaner solution would be to define a custom generator:

import itertools as IT
import pandas as pd

def valid(chunks):
    for chunk in chunks:
        mask = chunk['B'] < 10
        if mask.all():
            yield chunk
        else:
            yield chunk.loc[mask]
            break

chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
df = pd.concat(valid(chunks))

edited Jan 30, 2015 at 18:18

answered Jan 30, 2015 at 15:58

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

unutbu Over a year ago

@DSM: Do you mean chunk.ix[-1, 'B']?

unutbu Over a year ago

@DSM: Thanks, you're right. Even chunk.ix[-1, 'B'] would return the wrong value if chunk's index included -1 as a value.

rdmolony · Accepted Answer · 2023-11-03 11:39:55Z

7

Warning: pd.read_csv("filename.csv") loads the entire csv into an in-memory DataFrame before processing it (thanks @BlackJack for pointing this out). If the csv is big @unutbu answer is more appropriate (or perhaps another library like polars which can read a file in chunks & apply multiple operations thanks to its query planner)

Building on @joanwa answer:

df = (pd.read_csv("filename.csv")
      [lambda x: x['B'] > 10])

From Wes McKinney's "Python for Data Analysis" chapter on "Advanced pandas":

We cannot refer to the result of load_data until it has been assigned to the temporary variable df. To help with this, assign and many other pandas functions accept function-like arguments, also known as callables.

To show callables in action, consider ...

df = load_data()
df2 = df[df['col2'] < 0]

Can be rewritten as:

df = (load_data()
      [lambda x: x['col2'] < 0])

edited Nov 3, 2023 at 11:39

answered May 30, 2020 at 11:59

rdmolony

8211 gold badge13 silver badges19 bronze badges

4 Comments

My Work Over a year ago

Wow, this is magic!

Rachel Over a year ago

Hi I was also working on similar csv but I want to read my csv file until a row with certain text say ‘unwanted’ is found. Can we do it with lambda function .

BlackJack Over a year ago

It's formatted quite confusing and not magic, and it's not a solution to the question, which asks for reading to stop, not to read all data into memory and then filter. Also it gives different results because values in column "B" decrease below 10 again.

rdmolony Over a year ago

Thanks @Rachel & @BlackJack. @Rachel this solution is not appropriate for that use case since as @BlackJack points out it loads the entire csv into memory

jpp · Accepted Answer · 2018-08-11 20:53:29Z

You can use the built-in csv module to calculate the appropriate row number. Then use pd.read_csv with the nrows argument:

from io import StringIO
import pandas as pd
import csv, copy

mycsv = StringIO(""" A      B     C
34   3.20   'b'
24   9.21   'b'
34   3.32   'c'
24   24.3   'c'
35   1.12   'a'""")

mycsv2 = copy.copy(mycsv)  # copying StringIO object [for demonstration purposes]

with mycsv as fin:
    reader = csv.reader(fin, delimiter=' ', skipinitialspace=True)
    header = next(reader)
    counter = next(idx for idx, row in enumerate(reader) if float(row[1]) > 10)

df = pd.read_csv(mycsv2, delim_whitespace=True, nrows=counter+1)

print(df)

    A      B    C
0  34   3.20  'b'
1  24   9.21  'b'
2  34   3.32  'c'
3  24  24.30  'c'

Okroshiashvili · Accepted Answer · 2023-10-31 12:24:34Z

-1

Instead of boolean indexing or giving callable, one can use query method.

import pandas as pd

df = (pd.read_csv("my_data.csv").query("B < 10"))

I don't know how fast this solution is, however it should be faster then giving plain callable, especially lambda.

answered Oct 31, 2023 at 12:24

Okroshiashvili

4,1892 gold badges31 silver badges45 bronze badges

1 Comment

BlackJack Over a year ago

This does read the whole data into memory and then filters. So it doesn't answer the question. Also it doesn't stop the first time the value exceeded 10.

joanwa · Accepted Answer · 2016-03-14 11:04:37Z

-2

I would go the easy route described here:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

df[df['B'] > 10]

answered Mar 14, 2016 at 11:04

joanwa

477 bronze badges

3 Comments

LucyDrops Over a year ago

this is now what op is asking for. this can be applied after reading the entire csv

Nicoolasens Over a year ago

He would like to save time and memory skiping useless rows

BlackJack Over a year ago

Also this (non-)solution doesn't yield the desired result, which is the rows up to the first time "B" values exceed 10.

Collectives™ on Stack Overflow

Conditional row read of csv in pandas

5 Answers 5

2 Comments

4 Comments

Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

4 Comments

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related