Conditional merging of dataframe rows

Question

I have a 2xN dataframe of chat messages, and I am trying to find the cleanest way to merge consecutive messages that originate from the same speaker. Here is a sample of the data I am working with:

mydata = pd.DataFrame(data=[['A','random text'],
                            ['B','random text'],
                            ['A','random text'],
                            ['A','random text'],
                            ['A','random text'],
                            ['B','random text'],
                            ['A','random text'],
                            ['B','random text'],
                            ['B','random text'],
                            ['A','random text']], columns=['speaker','message'])

Hopefully you can see that the order of speakers is not in an ABAB format as I would like. Instead, there are some sequences of AAAB and ABBA. My current thinking is to rebuild the dataframe from scratch, checking the ID of each row with the ID of the next index position...

mergeCheck = True
while mergeCheck is True:
    # set length of the dataframe
    lenDF = len(mydata)
# empty list to rebuild dataframe
mergeDF = []
# set index position at the beginning of dataframe
i = 0            
while i < lenDF-1:
   # check whether adjacent rows have different ID
   if mydata['speaker'].iloc[i] != mydata['speaker'].iloc[i+1]:
       # if true, append row as is to mergeDF list
       mergeDF.append([mydata['speaker'].iloc[i],
                       mydata['message'].iloc[i]])
       # increase index position by 1
       i +=1
   else:
       # merge messages
       mergeDF.append([mydata['speaker'].iloc[i],
                       mydata['message'].iloc[i] + mydata['message'].iloc[i+1]])
       # increase index position by 2
       i +=2
# exit the loop if index position falls on the last message
if i == lenDF-1: 
    # if true, append row as is to mergeDF list
    mergeDF.append([mydata['speaker'].iloc[i],
                    mydata['message'].iloc[i]])
    # increase counter by 1
    i +=1
if i == lenDF:
    mergeCheck = False

However, this only works for two adjacent messages. Returning to my original data, when put into a dataframe, the above function generates the following output...

--------------------------
  speaker  |   message
--------------------------
    A         'random text'
    B         'random text'
    A         'random textrandom text'
    A         'random text'
    B         'random text'
    A         'random text'
    B         'random textrandom text'
    A         'random text'
--------------------------

I have thought to extend the function to check more comparisons of i (i.e. does '.iloc[i] != .iloc[i+2]', or '.iloc[i] != .iloc[i+3]' etc.), but this gets unworkable really quickly. What I think I need is some way to repeat the above function until the dataframe is in the desired format. But I'm unsure how to go about this.

Serge de Gosson de Varennes · Accepted Answer · 2020-12-23 14:20:16Z

1

A possible solution is this:

df1 = mydata[mydata['speaker']=='A'].reset_index()
df2= mydata[mydata['speaker']=='B'].reset_index()
df = pd.concat([df1, df2]).sort_index()

which returns

  index speaker      message
0      0       A  random text
0      1       B  random text
1      2       A  random text
1      5       B  random text
2      3       A  random text
2      7       B  random text
3      4       A  random text
3      8       B  random text
4      6       A  random text
5      9       A  random tex

if you have a timmestamp to these, remember to sort by time/date before resetting the index. Also, when concatenating beware of time.

EDIT

After your clarificationin the comments, I suggest this. Create first a key that matches equal entities (A, B) and then group by speakers and entities (keys)

df['key'] = (df['speaker'] != df['speaker'].shift(1)).astype(int).cumsum()

which gives

  speaker      message  key
0       A  random text    1
1       B  random text    2
2       A  random text    3
3       A  random text    3
4       A  random text    3
5       B  random text    4
6       A  random text    5
7       B  random text    6
8       B  random text    6
9       A  random text    7

Now, you simply need to groupby

df = df.groupby(['key', 'speaker'])['message'].apply(' '.join)
df

which gives

key  speaker
1    A                                  random text
2    B                                  random text
3    A          random text random text random text
4    B                                  random text
5    A                                  random text
6    B                      random text random text
7    A                                  random text

edited Dec 23, 2020 at 14:20

answered Dec 22, 2020 at 14:27

Serge de Gosson de Varennes

11.6k4 gold badges30 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

cookie1986 Over a year ago

Thanks for the input - Unfortunately this doesn't merge the message columns together. I probably should have made that clearer in the original post!

Serge de Gosson de Varennes Over a year ago

Ah! Now it makes sense. I edited my answer.

cookie1986 Over a year ago

Thanks, that's great, and much shorter than my approach!

cookie1986 · Accepted Answer · 2020-12-23 13:14:19Z

After some exploring, I have come up with a better solution than my OP. I will detail that here for anyone experiencing a similar issue. I will refrain from accepting my own answer for the time-being in case someone comes up with a better option.

# compare each row with the previous
mydata['prev_speaker'] = mydata['speaker'].shift(1).mask(pd.isnull, mydata['speaker'])

# boolean value to determine whether current speaker differs from previous
mydata['speaker_change'] = np.where(mydata['speaker'] != mydata['prev_speaker'], 'True','False')

# empty list to record changes in speaker
counterList = []    

# initialize a counter to loop through dataframe
counter =1

# loop through dataframe, increasing counter by 1 if the speaker changes
for row in mydata['speaker_change']:
    if row == 'False':
        counterList.append(counter)
    else:
        counter+=1
        counterList.append(counter)

# add counterList to dataframe
mydata['chunking'] = counterList

# group the original message based on the chunking variable
mydata['message'] = mydata.groupby(['chunking'])['message'].transform(lambda x: ' '.join(x))

# drop duplicate rows based on message content and chunking
mydata = mydata.drop_duplicates(subset=['message','chunking'])

# drop non-needed columns
mydata = mydata.drop(['prev_speaker','speaker_change','chunking'], axis=1)

Which now gives me the following:

|---------------------|-------------------------------------|
|       Speaker       |               Message               |
|---------------------|-------------------------------------|
|          A          |             random text             |
|---------------------|-------------------------------------|
|          B          |             random text             |
|---------------------|-------------------------------------|
|          A          | random text random text random text |
|---------------------|-------------------------------------|
|          B          |             random text             |
|---------------------|-------------------------------------|
|          A          |             random text             |
|---------------------|-------------------------------------|
|          B          |       random text random text       |
|---------------------|-------------------------------------|
|          A          |             random text             |
|---------------------|-------------------------------------|

Collectives™ on Stack Overflow

Conditional merging of dataframe rows

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related