How to read only a slice of data stored in a big csv file in python

Question

I cant read the data from a CSV file into memory because it is too large, i.e. doing pandas.read_csv using pandas won't work.

I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df that could hypothetically contain the full data from the CSV, I would do

df.loc[(df['column_name'] == 1)

The CSV file does contain a header, and it is ordered so I don't really need to use column_name but the order of that column if I have to.

How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful

Naga kiran · Accepted Answer · 2021-03-25 12:01:52Z

5

You can read the CSV file chunk by chunk and retain the rows which you want to have

iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )

edited Mar 25, 2021 at 12:01

answered Sep 26, 2018 at 11:59

Naga kiran

4,6071 gold badge21 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to read only a slice of data stored in a big csv file in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related