Pandas dataframe to_csv - split into multiple output files

Question

What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?

I thought about doing something like:

stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)): 
    start = i 
    end = i + stepsize-1 #neglect last row ...
    df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')

But I bet there is a smarter solution out there?

As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.

Trenton McKinney · Accepted Answer · 2020-09-06 22:57:20Z

26

This answer brought me to a satisfying solution using:

numpy.array_split(object, number_of_chunks)

for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
    chunk.to_csv(f'/data/bs_{idx}.csv')

edited Sep 6, 2020 at 22:57

Trenton McKinney

63.2k41 gold badges169 silver badges212 bronze badges

answered Jun 12, 2017 at 15:11

PlagTag

6,4996 gold badges41 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aathil Ahamed Over a year ago

You have attached the wrong link, it should be, numpy.org/doc/stable/reference/generated/numpy.array_split.html

Gautam Shahi · Accepted Answer · 2019-02-04 14:10:31Z

17

Use id in the filename else it will not work. You missed id, and without id, it gives an error.

for id, df_i in  enumerate(np.array_split(df, number_of_chunks)):
    df_i.to_csv('/data/bs_{id}.csv'.format(id=id))

edited Feb 4, 2019 at 14:10

answered Feb 3, 2019 at 22:17

Gautam Shahi

4554 silver badges14 bronze badges

Collectives™ on Stack Overflow

Pandas dataframe to_csv - split into multiple output files

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related