I have a pandas dataframe that effectively contains several different datasets. Between each dataset is a row full of NaN. Can I split the dataframe on the NaN row to make two dataframes? Thanks in advance.
3 Answers
You can use this to split into many data frames based on all NaN rows:
#index of all NaN rows (+ beginning and end of df)
idx = [0] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
#list of data frames split at all NaN indices
list_of_dfs = [df.iloc[idx[n]:idx[n+1]] for n in range(len(idx)-1)]
And if you want to exclude the NaN rows from split data frames:
idx = [-1] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
list_of_dfs = [df.iloc[idx[n]+1:idx[n+1]] for n in range(len(idx)-1)]
Example:
df:
0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN
3 NaN NaN
4 NaN NaN
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN
list_of_dfs:
[ 0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN,
Empty DataFrame
Columns: [0, 1]
Index: [],
0 1
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN]
Comments
My solution allows to split your DataFrame into any number of chunks, on each row full of NaNs.
Assume that the input DataFrame contains:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
3 NaN NaN NaN
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
7 NaN NaN NaN
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
so that "split points" are rows with indices 3 and 7.
To do your task:
Generate the grouping criterion Series:
grp = (df.isnull().sum(axis=1) == df.shape[1]).cumsum()Drop rows full of NaN and group the result by the above criterion:
gr = df.dropna(axis=0, thresh=1).groupby(grp)thresh=1means that for the current row it is enough to have 1 non-NaN value to be kept in the result.Perform actual split, as a list comprehension:
result = [ gr.get_group(key) for key in gr.groups ]
To print the result, you can run:
for i, chunk in enumerate(result):
print(f'Chunk {i}:')
print(chunk, end='\n\n')
getting:
Chunk 0:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
Chunk 1:
A B C
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
Chunk 2:
A B C
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
df.groupby(df.isnull.any(axis=1).cumsum())isnull(). I think OP also meantallinstead ofany.isnull()from the OPs description it sounds like either might work, butallwould be the safer bet