3

I have a pandas dataframe that effectively contains several different datasets. Between each dataset is a row full of NaN. Can I split the dataframe on the NaN row to make two dataframes? Thanks in advance.

3
  • df.groupby(df.isnull.any(axis=1).cumsum()) Commented Jul 13, 2020 at 4:16
  • 1
    @PaulH typo? isnull(). I think OP also meant all instead of any. Commented Jul 13, 2020 at 4:30
  • 1
    @Ehsan yeah isnull() from the OPs description it sounds like either might work, but all would be the safer bet Commented Jul 13, 2020 at 4:35

3 Answers 3

2

You can use this to split into many data frames based on all NaN rows:

#index of all NaN rows (+ beginning and end of df)
idx = [0] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
#list of data frames split at all NaN indices
list_of_dfs = [df.iloc[idx[n]:idx[n+1]] for n in range(len(idx)-1)]

And if you want to exclude the NaN rows from split data frames:

idx = [-1] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
list_of_dfs = [df.iloc[idx[n]+1:idx[n+1]] for n in range(len(idx)-1)]

Example:

df:

     0    1
0  1.0  1.0
1  NaN  1.0
2  1.0  NaN
3  NaN  NaN
4  NaN  NaN
5  1.0  1.0
6  1.0  1.0
7  NaN  1.0
8  1.0  NaN
9  1.0  NaN

list_of_dfs:

[     0    1
0  1.0  1.0
1  NaN  1.0
2  1.0  NaN, 

Empty DataFrame
Columns: [0, 1]
Index: [],   

     0    1
5  1.0  1.0
6  1.0  1.0
7  NaN  1.0
8  1.0  NaN
9  1.0  NaN]
Sign up to request clarification or add additional context in comments.

Comments

0

Use df[df[COLUMN_NAME].isnull()].index.tolist() to get a list of indices corresponding to the NaN rows. You can then split the dataframe into multiple dataframes by using the indices.

Comments

0

My solution allows to split your DataFrame into any number of chunks, on each row full of NaNs.

Assume that the input DataFrame contains:

       A    B     C
0   10.0  Abc  20.0
1   11.0  NaN  21.0
2   12.0  Ghi   NaN
3    NaN  NaN   NaN
4    NaN  Hkx  30.0
5   21.0  Jkl  32.0
6   22.0  Mno  33.0
7    NaN  NaN   NaN
8   30.0  Pqr  40.0
9    NaN  Stu   NaN
10  32.0  Vwx  44.0

so that "split points" are rows with indices 3 and 7.

To do your task:

  1. Generate the grouping criterion Series:

     grp = (df.isnull().sum(axis=1) == df.shape[1]).cumsum()
    
  2. Drop rows full of NaN and group the result by the above criterion:

     gr = df.dropna(axis=0, thresh=1).groupby(grp)
    

    thresh=1 means that for the current row it is enough to have 1 non-NaN value to be kept in the result.

  3. Perform actual split, as a list comprehension:

     result = [ gr.get_group(key) for key in gr.groups ]
    

To print the result, you can run:

for i, chunk in enumerate(result):
    print(f'Chunk {i}:')
    print(chunk, end='\n\n')

getting:

Chunk 0:
      A    B     C
0  10.0  Abc  20.0
1  11.0  NaN  21.0
2  12.0  Ghi   NaN

Chunk 1:
      A    B     C
4   NaN  Hkx  30.0
5  21.0  Jkl  32.0
6  22.0  Mno  33.0

Chunk 2:
       A    B     C
8   30.0  Pqr  40.0
9    NaN  Stu   NaN
10  32.0  Vwx  44.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.