1

I am trying to concatenate rows of data from this source

https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv

There are lines from the same speaker but they're broken into different rows in the dataframe. I'm trying to concatenate those speaker blocks into one row instead of 2+ rows.

Here's what I've tried but it doesn't work. I'm still learning pandas and python.

url = (r'https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv')
data = pd.read_csv(url, on_bad_lines='skip')
data.drop('I', inplace=True, axis=1)
data.drop('I.1', inplace=True, axis=1)
data.rename(columns={'In delivering my son from me, I bury a second husband.': 'Text', 'COUNTESS': 'Speaker'}, inplace=True)
data = data.groupby(['Speaker'])['Text'].apply(' '.join).reset_index()
2
  • Why do you think that it doesn't work? Commented Sep 20, 2022 at 18:53
  • 1
    @mkrieger1 It does concatenate, but it concatenates every line of that speaker into one long string. However, I want to keep each speech block in order. Commented Sep 20, 2022 at 18:56

1 Answer 1

1

A little ugly, but you can group by consecutive values using a helper series based on shift() and cumsum(). Then aggregating in the group by:

df = pd.read_csv('https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv',on_bad_lines='skip', names=['act','scene','char','line'])
g = df['char'].ne(df['char'].shift()).cumsum().rename('speakernumber')
df = df.groupby(g).agg({'act':'first', 'scene':'first', 'char':'first', 'line': ' '.join}).reset_index()

I believe this will work across acts and scenes as well althought I didn't dig in deep enough to test.

Sign up to request clarification or add additional context in comments.

2 Comments

This worked perfectly, I couldn't figure out the groupby without a good index, but this worked. Thanks!
I think the trick was to build that index on which to aggregate (g in this case). It's a similar approach in SQL where you would create a helper column using something like DENSE_RANK() OVER (PARTITION BY char) as speakernumber and then GROUP BY act, scene, speakernumber and aggregate using string_agg()/list_agg(). (If thinking through this in SQL is helpful).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.