I have a tweeter data set (6,000,000+ rows) and want to extract the conversation threads from it. Let's assume it looks like this:
data = pd.DataFrame({'Tweet_ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
'Reply_to_tweet_ID': [None, '1', '1', '3', None, '1', '4', '2', '4'],
'Max Speed': [380., 370., 24., 26., 584., 48., 8., 123., None]})
Tweet_ID Reply_to_tweet_ID Max Speed
0 1 None 380
1 2 1 370
2 3 1 24
3 4 3 26
4 5 None 584
5 6 1 48
6 7 4 8
7 8 2 123
8 9 4 None
Basically, I should match Reply_to_tweet_ID with Tweet_ID and concatenate the results of this match in one row. The results should look something like this:
Tweet_ID1 Reply_to_tweet_ID1 Max Speed1 Tweet_ID2 Reply_to_tweet_ID2 Max Speed2 Tweet_ID3 Reply_to_tweet_ID3 Max Speed3 Tweet_ID4 Reply_to_tweet_ID4 Max Speed4
1 None 380 2 1 370 3 1 24 6 1 48
2 1 370 8 2 123
3 1 24 4 3 26
4 3 26 7 4 8 9 4 None
A similar question has been asked here, but the answer is not true.
My code is:
df = data.set_index(['Reply_to_tweet_ID', data.groupby('Reply_to_tweet_ID')\
.cumcount().add(1)])[['Tweet_ID','Max Speed']]\
.unstack().reset_index()
df.columns = ["{}{}".format(a, b) for a, b in df.columns]
df = df[df.Reply_to_tweet_ID != 'None']
but the result looks like this:
Reply_to_tweet_ID Tweet_ID1 Tweet_ID2 Tweet_ID3 Max Speed1 Max Speed2 Max Speed3
0 1 2 3 6 370 24 48
1 2 8 NaN NaN 123 NaN NaN
2 3 4 NaN NaN 26 NaN NaN
3 4 7 9 NaN 8 None NaN
dfhas duplicates columns