0

I have a df

df = pd.DataFrame(np.random.randn(11,3))

           0         1         2
0   0.102645 -1.530977  0.408735
1   1.081442  0.615082 -1.457931
2   1.852951  0.360998  0.178162
3   0.726028  2.072609 -1.167996
4  -0.454453  1.310887 -0.969910
5  -0.098552 -0.718283  0.372660
6   0.334170 -0.347934 -0.626079
7  -1.034541 -0.496949 -0.287830
8   1.870277  0.508380 -2.466063
9   1.464942 -0.020060 -0.684136
10 -1.057930  0.295145  0.161727

How can I split this in a given number of subsections, lets say 2 for now.

Something like this

           0         1         2
0   0.102645 -1.530977  0.408735
1   1.081442  0.615082 -1.457931
2   1.852951  0.360998  0.178162
3   0.726028  2.072609 -1.167996
4  -0.454453  1.310887 -0.969910

           0         1         2
5  -0.098552 -0.718283  0.372660
6   0.334170 -0.347934 -0.626079
7  -1.034541 -0.496949 -0.287830
8   1.870277  0.508380 -2.466063
9   1.464942 -0.020060 -0.684136
10 -1.057930  0.295145  0.161727

Ideally I would like to use np.array_split(df, 2) but it throws an error as its not an array.

Is there a built in function to do this? I don't particularly want to use df.loc[a:b] because its difficult to calculate the start and end depending on the given number of sub-dataframes needed.

1 Answer 1

1

Try the following. It should return an array of n sub-dataframes if concatenated would return the original dataframe in question.

import math

def split(df, n):
    size = math.ceil(len(df) / n)
    return [ df[i:i + size] for i in range(0, len(df), size) ]
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for this but the only issue is the remainder. Split(df,2) for my df returns 3 sub dfs. Is there no way to use np.arry_split() some how as that handles remainders automatically.
If you're using Python 2.x, try changing the line to calculate size to size = math.ceil(float(len(df)) / n)
I have no idea what you have done but its working well, I'll run some more tests and let you know it goes but thanks!
Python 2.x, / will default to integer division if the 2 operands are integers. In Python 3, it'll perform floating point division, which is required for the bucket size to be calculated properly. So, that's why explicitly converting the dataframe length to a floating point number fixed your problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.