Structuring a 2D array from a pandas dataframe

Question

I have a pandas dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)

Output:

               Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 0
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 0
11            third                 0
12         sentence                 0
13                .                 0

Now, I want to regroup the Text column into a 2D array based on the Selection Valuecolumn. All words that appear between a 0 (first integer, or after a 1) and a 1(including) should be put into a 2D array. The last sentence of the dataset might have no closing 1. This can be done as explained in this question: Regroup pandas column into 2D list based on another column

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

I would like to go a step further and place the following condition: If more than max_number_of_cells_per_listof non-NaN cells are in a list, then this list should be divided into roughly equal parts which contain at most +/- 1 of max_number_of_cells_per_list cell elements.

Let's say: max_number_of_cells_per_list = 2, then the expected output should be:

 [["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]

Example:

Based on the column 'Selection_Values' one can regroup the cells into the following 2D list, using:

[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]

Output (original list):

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

Let's have a look at the number of cells that are within those lists:

As you can see, list1 has 6 cells, list 2 has 2 cells, and list 3 has 5 cells.

Now, what I would like to achieve is the following: if there are more than a certain number of cells in a list, it should be split up, such that each resulting list has +/-1 the wanted number of cells.

So for example max_number_of_cells_per_list = 2

Modified list:

Do you see a way of doing this ?

EDIT: Important note: Cells from the original lists should not be put into the same lists.

EDIT 2:

               Text  Selection_Values  New
0                Hi                 0  1.0
1           this is                 0  0.0
2              just                 0  1.0
3                 a                 0  0.0
4            single                 0  1.0
5         sentence.                 1  0.0
6              This                 0  1.0
7               NaN                 0  0.0
8   is another one.                 1  1.0
9           This is                 0  0.0
10                a                 0  1.0
11            third                 0  0.0
12         sentence                 0  0.0
13                .                 0  NaN

can we define max_number_of_cells_per_list before this operation? — anky
– anky, Commented Jul 21, 2019 at 11:40
@anky_91, yes, you can.... but you cannot put two cells from different original lists together. So for instance, you cannot put the Thisfrom list 2 into list 1. — henry
– henry, Commented Jul 21, 2019 at 11:40

anky · Accepted Answer · 2019-07-21 18:24:37Z

5

IIUC, you can do something like:

n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)

[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]

[['Hi this is'], ['just a'], ['single sentence.'], 
    ['This is another one.'], ['This is a'], ['third sentence .']]

EDIT:

d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
                        ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)

               Text  Selection_Values  new
0                Hi                 0    1
1           this is                 0    0
2              just                 0    1
3                 a                 0    0
4            single                 0    1
5         sentence.                 1    0
6              This                 0    1
7               NaN                 0    0
8   is another one.                 1    0
9           This is                 0    1
10                a                 0    0
11            third                 0    1
12         sentence                 0    0
13                .                 0    0

edited Jul 21, 2019 at 18:24

answered Jul 21, 2019 at 12:40

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

henry Over a year ago

Question: Is it possible to generate a list, like 'Selection_Values' for the final selection to insert as a new column into the dataset ?

anky Over a year ago

@henry you mean the c variable? : s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False).astype(int) ?? 1 is where it starts until next 1

henry Over a year ago

Thanks, I added this a New column. As you can see in EDIT 2, the 1from the Newcolumn and the Selection_Values do not match.

henry Over a year ago

This is probably due to the NaNvalue, which is not correctly matched.

Collectives™ on Stack Overflow

Structuring a 2D array from a pandas dataframe

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related