5

I have a pandas dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)

Output:

               Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 0
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 0
11            third                 0
12         sentence                 0
13                .                 0

Now, I want to regroup the Text column into a 2D array based on the Selection Valuecolumn. All words that appear between a 0 (first integer, or after a 1) and a 1(including) should be put into a 2D array. The last sentence of the dataset might have no closing 1. This can be done as explained in this question: Regroup pandas column into 2D list based on another column

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

I would like to go a step further and place the following condition: If more than max_number_of_cells_per_listof non-NaN cells are in a list, then this list should be divided into roughly equal parts which contain at most +/- 1 of max_number_of_cells_per_list cell elements.

Let's say: max_number_of_cells_per_list = 2, then the expected output should be:

 [["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]

Example:

Based on the column 'Selection_Values' one can regroup the cells into the following 2D list, using:

[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]

Output (original list):

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

Let's have a look at the number of cells that are within those lists:

enter image description here

As you can see, list1 has 6 cells, list 2 has 2 cells, and list 3 has 5 cells.

Now, what I would like to achieve is the following: if there are more than a certain number of cells in a list, it should be split up, such that each resulting list has +/-1 the wanted number of cells.

So for example max_number_of_cells_per_list = 2

Modified list: enter image description here

Do you see a way of doing this ?

EDIT: Important note: Cells from the original lists should not be put into the same lists.

EDIT 2:

               Text  Selection_Values  New
0                Hi                 0  1.0
1           this is                 0  0.0
2              just                 0  1.0
3                 a                 0  0.0
4            single                 0  1.0
5         sentence.                 1  0.0
6              This                 0  1.0
7               NaN                 0  0.0
8   is another one.                 1  1.0
9           This is                 0  0.0
10                a                 0  1.0
11            third                 0  0.0
12         sentence                 0  0.0
13                .                 0  NaN
2
  • can we define max_number_of_cells_per_list before this operation? Commented Jul 21, 2019 at 11:40
  • 1
    @anky_91, yes, you can.... but you cannot put two cells from different original lists together. So for instance, you cannot put the Thisfrom list 2 into list 1. Commented Jul 21, 2019 at 11:40

1 Answer 1

5

IIUC, you can do something like:

n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)

[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]

[['Hi this is'], ['just a'], ['single sentence.'], 
    ['This is another one.'], ['This is a'], ['third sentence .']]

EDIT:

d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
                        ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)

               Text  Selection_Values  new
0                Hi                 0    1
1           this is                 0    0
2              just                 0    1
3                 a                 0    0
4            single                 0    1
5         sentence.                 1    0
6              This                 0    1
7               NaN                 0    0
8   is another one.                 1    0
9           This is                 0    1
10                a                 0    0
11            third                 0    1
12         sentence                 0    0
13                .                 0    0
Sign up to request clarification or add additional context in comments.

4 Comments

Question: Is it possible to generate a list, like 'Selection_Values' for the final selection to insert as a new column into the dataset ?
@henry you mean the c variable? : s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False).astype(int) ?? 1 is where it starts until next 1
Thanks, I added this a New column. As you can see in EDIT 2, the 1from the Newcolumn and the Selection_Values do not match.
This is probably due to the NaNvalue, which is not correctly matched.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.