1

I have a tweeter data set (6,000,000+ rows) and want to extract the conversation threads from it. Let's assume it looks like this:

data = pd.DataFrame({'Tweet_ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                   'Reply_to_tweet_ID': [None, '1', '1', '3', None, '1', '4', '2', '4'],
                   'Max Speed': [380., 370., 24., 26., 584., 48., 8., 123., None]})
    Tweet_ID    Reply_to_tweet_ID   Max Speed
0   1           None                380
1   2           1                   370
2   3           1                   24
3   4           3                   26
4   5           None                584
5   6           1                   48
6   7           4                   8
7   8           2                   123
8   9           4                   None

Basically, I should match Reply_to_tweet_ID with Tweet_ID and concatenate the results of this match in one row. The results should look something like this:

Tweet_ID1   Reply_to_tweet_ID1  Max Speed1  Tweet_ID2   Reply_to_tweet_ID2  Max Speed2  Tweet_ID3   Reply_to_tweet_ID3  Max Speed3  Tweet_ID4   Reply_to_tweet_ID4  Max Speed4
1           None                380         2           1                   370         3           1                   24          6           1                   48
2           1                   370         8           2                   123
3           1                   24          4           3                   26
4           3                   26          7           4                   8           9           4                   None

A similar question has been asked here, but the answer is not true.

My code is:

df = data.set_index(['Reply_to_tweet_ID', data.groupby('Reply_to_tweet_ID')\
.cumcount().add(1)])[['Tweet_ID','Max Speed']]\
.unstack().reset_index()

df.columns = ["{}{}".format(a, b) for a, b in df.columns]

df = df[df.Reply_to_tweet_ID != 'None']

but the result looks like this:

    Reply_to_tweet_ID   Tweet_ID1   Tweet_ID2   Tweet_ID3   Max Speed1  Max Speed2  Max Speed3
0                   1           2           3           6          370          24          48
1                   2           8           NaN         NaN        123          NaN                 NaN
2                   3           4           NaN         NaN        26           NaN                 NaN
3                   4           7           9           NaN        8           None                 NaN

1
  • expected df has duplicates columns Commented May 8, 2020 at 11:31

1 Answer 1

1

IIUC, you can do this with a 'self-join' using merge, then reshaping the dataframe, flatten multiindex column headers and merge back to original dataframe:

#Create merged dataframe, data_m, to join Reply_to_tweet_ID to Tweet_ID
data_m = data[['Tweet_ID']].merge(data[['Reply_to_tweet_ID','Max Speed','Tweet_ID']], 
                                               left_on='Tweet_ID', 
                                               right_on='Reply_to_tweet_ID',
                                              suffixes=('','_y'))

#Use `set_index` with `groupby` and `cumcount` then `unstack` to 
#reshape long to wide for dataframe, data_u
data_u = data_m.set_index(['Tweet_ID', data_m.groupby('Tweet_ID').cumcount()+1]).unstack()
data_u = data_u.sort_index(axis=1, level=1)

#Flatten multiindex column header using list comprenhension    
data_u.columns = [f'{i}{j}' for i, j in data_u.columns]

#merge dataframe, data_u to the orginal dataframe, data
print(data.merge(data_u, on='Tweet_ID'))

Output:

  Tweet_ID Reply_to_tweet_ID  Max Speed  Max Speed1 Reply_to_tweet_ID1 Tweet_ID_y1  Max Speed2 Reply_to_tweet_ID2 Tweet_ID_y2  Max Speed3 Reply_to_tweet_ID3 Tweet_ID_y3
0        1              None      380.0       370.0                  1           2        24.0                  1           3        48.0                  1           6
1        2                 1      370.0       123.0                  2           8         NaN                NaN         NaN         NaN                NaN         NaN
2        3                 1       24.0        26.0                  3           4         NaN                NaN         NaN         NaN                NaN         NaN
3        4                 3       26.0         8.0                  4           7         NaN                  4           9         NaN                NaN         NaN
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.