1

I am trying to read a CSV file to a Dataframe but having issues as the CSv is too large (the process is just being killed).

I am only trying to do some simple updates to the Dataframe.

This is my current code:

df = pd.read_csv(input_file)
df = df[df.col_5 != 'col_5']
columns_req = ['COL_1','COL_2','COL_3','COL_4']
df = df.loc[:,columns_req]
df = df.rename(columns={col:col.lower() for col in df.columns})
df.to_csv(output_file, sep=',', index=False)

All of the code above works as expected when using a smaller CSV however breaks when using a larger CSV.

Is there any way I can process this?

I have read that I can iterate such as:

foo = pd.read_csv(input_file, iterator=True, chunksize=1000)

But I don't know if this will work as I expect. How do I apply my alterations to foo and then combine all the rows again at the end?

1 Answer 1

3

You could read, as you say in chunks. Here is an example:

import pandas as pd
import numpy as np
import time 
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(10000000,10)),columns=['A','B','C','D','E','F','G','H','I','J'])
df['K'] = pd.util.testing.rands_array(5,10000000)
df.to_csv("my_file.csv")

If you read your file the usual way:

start = time.time()
df = pd.read_csv('my_file.csv')
end = time.time()
print("Reading time: ",(end-start),"sec")

Read time:   20.328343152999878 sec

while reading in chunks

start = time.time()
chunk = pd.read_csv('my_file.csv',chunksize=1000000)
end = time.time()
print("Reading time: ",(end-start),"sec")
pd_df = pd.concat(chunk)

Reading time:   0.011000394821166992 sec
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.