Concatenate rows based on name with pandas dataframe

Question

I am trying to concatenate rows of data from this source

https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv

There are lines from the same speaker but they're broken into different rows in the dataframe. I'm trying to concatenate those speaker blocks into one row instead of 2+ rows.

Here's what I've tried but it doesn't work. I'm still learning pandas and python.

url = (r'https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv')
data = pd.read_csv(url, on_bad_lines='skip')
data.drop('I', inplace=True, axis=1)
data.drop('I.1', inplace=True, axis=1)
data.rename(columns={'In delivering my son from me, I bury a second husband.': 'Text', 'COUNTESS': 'Speaker'}, inplace=True)
data = data.groupby(['Speaker'])['Text'].apply(' '.join).reset_index()

@mkrieger1 It does concatenate, but it concatenates every line of that speaker into one long string. However, I want to keep each speech block in order. — EthanMcQ
– EthanMcQ, Commented Sep 20, 2022 at 18:56

JNevill · Accepted Answer · 2022-09-20 19:22:19Z

1

A little ugly, but you can group by consecutive values using a helper series based on shift() and cumsum(). Then aggregating in the group by:

df = pd.read_csv('https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv',on_bad_lines='skip', names=['act','scene','char','line'])
g = df['char'].ne(df['char'].shift()).cumsum().rename('speakernumber')
df = df.groupby(g).agg({'act':'first', 'scene':'first', 'char':'first', 'line': ' '.join}).reset_index()

I believe this will work across acts and scenes as well althought I didn't dig in deep enough to test.

answered Sep 20, 2022 at 19:22

JNevill

50.6k4 gold badges46 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

EthanMcQ Over a year ago

This worked perfectly, I couldn't figure out the groupby without a good index, but this worked. Thanks!

JNevill Over a year ago

I think the trick was to build that index on which to aggregate (g in this case). It's a similar approach in SQL where you would create a helper column using something like DENSE_RANK() OVER (PARTITION BY char) as speakernumber and then GROUP BY act, scene, speakernumber and aggregate using string_agg()/list_agg(). (If thinking through this in SQL is helpful).

Collectives™ on Stack Overflow

Concatenate rows based on name with pandas dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related