I have a 2xN dataframe of chat messages, and I am trying to find the cleanest way to merge consecutive messages that originate from the same speaker. Here is a sample of the data I am working with:
mydata = pd.DataFrame(data=[['A','random text'],
['B','random text'],
['A','random text'],
['A','random text'],
['A','random text'],
['B','random text'],
['A','random text'],
['B','random text'],
['B','random text'],
['A','random text']], columns=['speaker','message'])
Hopefully you can see that the order of speakers is not in an ABAB format as I would like. Instead, there are some sequences of AAAB and ABBA. My current thinking is to rebuild the dataframe from scratch, checking the ID of each row with the ID of the next index position...
mergeCheck = True
while mergeCheck is True:
# set length of the dataframe
lenDF = len(mydata)
# empty list to rebuild dataframe
mergeDF = []
# set index position at the beginning of dataframe
i = 0
while i < lenDF-1:
# check whether adjacent rows have different ID
if mydata['speaker'].iloc[i] != mydata['speaker'].iloc[i+1]:
# if true, append row as is to mergeDF list
mergeDF.append([mydata['speaker'].iloc[i],
mydata['message'].iloc[i]])
# increase index position by 1
i +=1
else:
# merge messages
mergeDF.append([mydata['speaker'].iloc[i],
mydata['message'].iloc[i] + mydata['message'].iloc[i+1]])
# increase index position by 2
i +=2
# exit the loop if index position falls on the last message
if i == lenDF-1:
# if true, append row as is to mergeDF list
mergeDF.append([mydata['speaker'].iloc[i],
mydata['message'].iloc[i]])
# increase counter by 1
i +=1
if i == lenDF:
mergeCheck = False
However, this only works for two adjacent messages. Returning to my original data, when put into a dataframe, the above function generates the following output...
--------------------------
speaker | message
--------------------------
A 'random text'
B 'random text'
A 'random textrandom text'
A 'random text'
B 'random text'
A 'random text'
B 'random textrandom text'
A 'random text'
--------------------------
I have thought to extend the function to check more comparisons of i (i.e. does '.iloc[i] != .iloc[i+2]', or '.iloc[i] != .iloc[i+3]' etc.), but this gets unworkable really quickly. What I think I need is some way to repeat the above function until the dataframe is in the desired format. But I'm unsure how to go about this.