2

I cant read the data from a CSV file into memory because it is too large, i.e. doing pandas.read_csv using pandas won't work.

I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df that could hypothetically contain the full data from the CSV, I would do

df.loc[(df['column_name'] == 1)

The CSV file does contain a header, and it is ordered so I don't really need to use column_name but the order of that column if I have to.

How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful

1 Answer 1

5

You can read the CSV file chunk by chunk and retain the rows which you want to have

iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.