98

I have pandas DataFrame which I have composed from concat. One row consists of 96 values, I would like to split the DataFrame from the value 72.

So that the first 72 values of a row are stored in Dataframe1, and the next 24 values of a row in Dataframe2.

I create my DF as follows:

temps = DataFrame(myData)
datasX = concat(
[temps.shift(72), temps.shift(71), temps.shift(70), temps.shift(69), temps.shift(68), temps.shift(67),
 temps.shift(66), temps.shift(65), temps.shift(64), temps.shift(63), temps.shift(62), temps.shift(61),
 temps.shift(60), temps.shift(59), temps.shift(58), temps.shift(57), temps.shift(56), temps.shift(55),
 temps.shift(54), temps.shift(53), temps.shift(52), temps.shift(51), temps.shift(50), temps.shift(49),
 temps.shift(48), temps.shift(47), temps.shift(46), temps.shift(45), temps.shift(44), temps.shift(43),
 temps.shift(42), temps.shift(41), temps.shift(40), temps.shift(39), temps.shift(38), temps.shift(37),
 temps.shift(36), temps.shift(35), temps.shift(34), temps.shift(33), temps.shift(32), temps.shift(31),
 temps.shift(30), temps.shift(29), temps.shift(28), temps.shift(27), temps.shift(26), temps.shift(25),
 temps.shift(24), temps.shift(23), temps.shift(22), temps.shift(21), temps.shift(20), temps.shift(19),
 temps.shift(18), temps.shift(17), temps.shift(16), temps.shift(15), temps.shift(14), temps.shift(13),
 temps.shift(12), temps.shift(11), temps.shift(10), temps.shift(9), temps.shift(8), temps.shift(7),
 temps.shift(6), temps.shift(5), temps.shift(4), temps.shift(3), temps.shift(2), temps.shift(1), temps,
 temps.shift(-1), temps.shift(-2), temps.shift(-3), temps.shift(-4), temps.shift(-5), temps.shift(-6),
 temps.shift(-7), temps.shift(-8), temps.shift(-9), temps.shift(-10), temps.shift(-11), temps.shift(-12),
 temps.shift(-13), temps.shift(-14), temps.shift(-15), temps.shift(-16), temps.shift(-17), temps.shift(-18),
 temps.shift(-19), temps.shift(-20), temps.shift(-21), temps.shift(-22), temps.shift(-23)], axis=1)

Question is: How can split them? :)

2
  • And N datafames automatically? Commented Oct 22, 2018 at 23:36
  • 6
    please edit the question to specify that you want to split vertically along columns and not horizontally along rows. Commented Jun 5, 2020 at 12:01

6 Answers 6

134

iloc

df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]

(iloc docs)

Sign up to request clarification or add additional context in comments.

1 Comment

Note this comment. If you were trying to split on rows it would just be df[:72].
84

use np.split(..., axis=1):

Demo:

In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))

In [256]: df
Out[256]:
          a         b         c         d         e         f
0  0.823638  0.767999  0.460358  0.034578  0.592420  0.776803
1  0.344320  0.754412  0.274944  0.545039  0.031752  0.784564
2  0.238826  0.610893  0.861127  0.189441  0.294646  0.557034
3  0.478562  0.571750  0.116209  0.534039  0.869545  0.855520
4  0.130601  0.678583  0.157052  0.899672  0.093976  0.268974

In [257]: dfs = np.split(df, [4], axis=1)

In [258]: dfs[0]
Out[258]:
          a         b         c         d
0  0.823638  0.767999  0.460358  0.034578
1  0.344320  0.754412  0.274944  0.545039
2  0.238826  0.610893  0.861127  0.189441
3  0.478562  0.571750  0.116209  0.534039
4  0.130601  0.678583  0.157052  0.899672

In [259]: dfs[1]
Out[259]:
          e         f
0  0.592420  0.776803
1  0.031752  0.784564
2  0.294646  0.557034
3  0.869545  0.855520
4  0.093976  0.268974

np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:

In [260]: dfs = np.split(df, [2,3], axis=1)

In [261]: dfs[0]
Out[261]:
          a         b
0  0.823638  0.767999
1  0.344320  0.754412
2  0.238826  0.610893
3  0.478562  0.571750
4  0.130601  0.678583

In [262]: dfs[1]
Out[262]:
          c
0  0.460358
1  0.274944
2  0.861127
3  0.116209
4  0.157052

In [263]: dfs[2]
Out[263]:
          d         e         f
0  0.034578  0.592420  0.776803
1  0.545039  0.031752  0.784564
2  0.189441  0.294646  0.557034
3  0.534039  0.869545  0.855520
4  0.899672  0.093976  0.268974

3 Comments

note np.split has been deprecated and will give a FutureWarning when it is used.
@SteveScott np.split doesn’t seem deprecated in Numpy 2.3, works pretty well without any warning.
Oh, however on Pandas DataFrames it uses a deprecated Pandas component (swapaxes) and no fix is planned, so it should be used on the index to ensure future compatibility, see stackoverflow.com/a/77858849/812102.
20

I generally use array split because it's easier simple syntax and scales better with more than 2 partitions.

import numpy as np
partitions = 2
dfs = np.array_split(df, partitions)

np.split(df, [100,200,300], axis=0] wants explicit index numbers which may or may not be desirable.

1 Comment

array_split is deprecated too :-( It use np.split
0

A bit too long for a comment : note that np.split also accepts a number of sections instead of the indices, but will raise an error if it cannot split into equal length dataframes :

>>> len(df)
20

>>> a = np.split(df, 4)
>>> [len(u) for u in a]
[5, 5, 5, 5]

>>> a = np.split(df, 3)
ValueError: array split does not result in an equal division

np.array_split does the same but will use a best fit instead of raising an error :

>>> a = np.array_split(df, 4)
>>> [len(u) for u in a]
[5, 5, 5, 5]

>>> a = np.array_split(df, 3)
>>> [len(u) for u in a]
[7, 7, 6]

WARNING : as noted in the comments above, when applied on Pandas DataFrames, these functions use pandas.swapaxes, which is deprecated, and it won’t be fixed. It should be used on the dataframe index instead. A convenient function :

def split(df, num_chunks):
    return [
        df.loc[chunk_idx]
        for chunk_idx in np.array_split(df.index, num_chunks)
    ]

>>> a = split(df, 4)
>>> [len(u) for u in a]
[5, 5, 5, 5]

Comments

0

In addition to deprecated DataFrame.swapaxes call under the hood of numpy.split we can look at a source code which is equivalent to:

import numpy as np
def split_df(df, sections, axis=0):
    Nsections = len(sections) + 1
    div_points = [0] + list(sections) + [df.shape[axis]]
    sub_arrays = []
    sary = np.swapaxes(df, axis, 0) #throws warning of DataFrame.swapaxes deprecation
    for i in range(Nsections):
        st = div_points[i]
        end = div_points[i + 1]
        sub_arrays.append(np.swapaxes(sary[st:end], axis, 0))
    return sub_arrays

It looks like use of np.swapaxes transposes df in order to call a single slicing on it and transpose it back. So it could be replaced with advanced indexing of df :

def split_df(df, sections, axis=0):
    Nsections = len(sections) + 1
    div_points = [0] + list(sections) + [df.shape[axis]]
    sub_arrays = []
    for i in range(Nsections):
        st = div_points[i]
        end = div_points[i + 1]
        if axis == 0:
            sub_arrays.append(df.iloc[st:end, :])
        elif axis == 1:
            sub_arrays.append(df.iloc[:, st:end])
    return sub_arrays

or more pure pythonic approach:

def split_df(df, sections, axis=0):
    div_points = [0] + list(sections) + [df.shape[axis]]
    if axis == 0:
        sub_arrays = [df.iloc[st:end, :] for st, end in zip(div_points[:-1], div_points[1:])]
    elif axis == 1:
        sub_arrays = [df.iloc[:, st:end] for st, end in zip(div_points[:-1], div_points[1:])]
    return sub_arrays

df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
split_df(df, [2, 3], axis=0)

Comments

0

If you just want to split by a column position, use iloc with a split index (also mentioned in the first answer) and take a .copy to avoid the SettingWithCopyWarning:

k = 72
df_left  = datasX.iloc[:, :k].copy() # for the first 72 columns
df_right = datasX.iloc[:, :k].copy() # for the remaining 24 columns

If you prefer to define the split by last 24 columns, compute k from the shape:

k = datasX.shape[1] - 24
df_left  = datasX.iloc[:, :k].copy()
df_right = datasX.iloc[:, k:].copy()

And you could also wrap it into a small helper to make it reusable, so it would be something like:

def split_at(df, k):
    """Return two DataFrames split at column index k."""
    return df.iloc[:, :k].copy(), df.iloc[:, k:].copy()

df1, df2 = split_at(datasX, 72)

And with that you would keep the original index and column names on both outputs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.