3

To prep my data correctly for a ML task, I need to be able to split my original dataframe into multiple smaller dataframes. I want to get all the rows above and including the row where the value for column 'BOOL' is 1 - for every occurrence of 1. i.e. n dataframes where n is the number of occurences of 1.

A sample of the data:

df = pd.DataFrame({"USER_ID": ['001', '001', '001', '001', '001'],
'VALUE' : [1, 2, 3, 4, 5], "BOOL": [0, 1, 0, 1, 0]})

Expected Output is 2 dataframes as shown:

enter image description here

And:

enter image description here

I have considered a for loop using if-else statements to append rows - but it is highly inefficient for the data-set I am using. Looking for a more pythonic way of doing this.

3 Answers 3

5

You can use np.split which accepts an array of indices where to split:

np.split(df, *np.where(df.BOOL == 1))

If you want to include the rows with BOOL == 1 to the previous data frame you can just add 1 to all the indices:

np.split(df, np.where(df.BOOL == 1)[0] + 1)
Sign up to request clarification or add additional context in comments.

8 Comments

Works like a charm, but how do I access each of the resulting dataframes?
@Ash What do you mean by "access"? The function returns a list that contains all the data frames so you can access that list. Note that the indices are retained within each of the sub-data frames.
np.split(df, np.where(df.BOOL == 1)[0] + 1) dose not work also you split the dataframe to 3 , I think he need 0 to n (n is BOOL ==1 index )
@Wen-Ben Why not? It does work for the given example and it won't raise an error even if the indices run out of range; in that case you just get empty data frames.
@a_guest I think in his expected output he need two dataframe(0-1 and 0-3) , and you return 3 , each of the length is 2,2,1 am I right ?
|
3

I think using for loop is better here

idx=df.BOOL.nonzero()[0]

d={x : df.iloc[:y+1,:] for x , y in enumerate(idx)}
d[0]
   BOOL USER_ID  VALUE
0     0     001      1
1     1     001      2

5 Comments

Really good approach - which works on the sample dataset. But for some cryptic reason does not work on my actual dataframe. It returns n dataframes - all of the original size.
@Ash anyway , I just follow your expected output(above two pics)
@Wen-Ben You mix index and iloc that's probably the reason why it doesn't work for the other data frame (in case the indices there are not a simple enumeration).
@Wen-Ben But now for non-numeric indices the +1 will fail. So you should probably stick to iloc and use the positions of the index.
@a_guest check nonzero
2

Why not list comprehension? like:

>>> l=[df.iloc[:i+1] for i in df.index[df['BOOL']==1]]
>>> l[0]
   BOOL USER_ID  VALUE
0     0     001      1
1     1     001      2
>>> l[1]
   BOOL USER_ID  VALUE
0     0     001      1
1     1     001      2
2     0     001      3
3     1     001      4
>>> 

4 Comments

Simplifies @Wen-Ben's approach to 1 line - but I still have the same issue. Works on the sample dataset. But not on my actual dataframe. This returns n dataframes - all of the original size.
You mix index and iloc that's probably the reason why it doesn't work for the other data frame (in case the indices there are not a simple enumeration).
@a_guest Sorry mate, can you explain what you mean? Not quite sure I understand what you mean by mix index and iloc?
@Ash iloc returns the index position while loc uses indices themselves. So for your example there's not difference as your index is [0, 1, 2, 4, 5] and the indices match their positions. However if you used for example ['a', 'b', 'c', 'd', 'e'] as index, then df.index[df.BOOL == 1] would return ['b', 'd'] while iloc expects the corresponding positions, i.e. [1, 3]. loc on the other hand does expect indices however then you can't do the increment i + 1. So in that case you should stick to index_position = df.BOOL.nonzero()[0] + 1 and use it together with df.iloc[:i].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.