Return multiple rows per row for pandas DataFrame

Question

Sample Input DataFrame:

merged_df
                 Full Name   Kommata 2007     Kommata 2015                 Kommata 2019
0        Athanasios bouras   New democracy    New democracy                New democracy
1        Andreas loverdos    Pasok            Pasok-democratic alignment   Movement for change
2        Theodora tzakri     Pasok            Pasok                        Syriza
3        Thanasis zempilis   Pasok            NaN                          New democracy

Desired DataFrame:

edges_df

         Source                             Target         
0        New democracy_2007                 New democracy_2015
1        New democracy_2015                 New democracy_2019
2        Pasok_2007                         Pasok-democratic alignment_2015
3        Pasok-democratic alignment_2015    Movement for change_2019
4        Pasok_2007                         Pasok_2015
5        Pasok_2015                         Syriza_2019
6        Pasok_2007                         New democracy_2019

As implied above, I have an input DataFrame with n columns; the first one has unique values (Full Name) and the other n-1 (Kommata YYYY) are some attributes of the rows. I want to generate a new DataFrame with two columns as follows:

For each Full Name it will have 0 or more rows
Starting from the leftmost Kommata column, it takes every adjacent pair of not null values e.g. Kommata 2007-Kommata 2015, Kommata 2015-Kommata 2019; the pair Kommata 2007-Kommata 2019 can only exist if Kommata 2015 is null
Every pair will be a new row
Each column's value is modified like this: value_YYYY where the value remains the same and the YYYY is taken from the column name (e.g. '{}_{}'.format(prev_value, col_name.split()[-1]))

Thanks in advance

Juan Estevez · Accepted Answer · 2019-12-24 16:43:07Z

1

You can use pd.melt to do this:

# A list of columns to melt.
value_cols = list(df.columns)[1:]

# Melt said columns while leaving the others (in this case only 'Full Name') intact.
df = pd.melt(df, id_vars=['Full Name'], value_vars=value_cols)

# Get the year from 'variable'
df['variable'] = df['variable'].str.split(' ').apply(lambda x:x[-1])

# Sort the values by 'Full Name' and then year (required).
df = df.sort_values(by=['Full Name', 'variable'])

# Drop rows with empty values.
df = df.dropna()

df['Source'] = df['value'] + '_' + df['variable']

# Pair the values (This is why the previous sort is required).
df['Target'] = df['Source'].shift(-1)

# Remove rows where the values don't belong to the same name.
mask = df['Full Name'].eq(df['Full Name'].shift(-1).bfill())
df = df.loc[mask]

# Keep only relevant columns.
df = df.reindex(columns=['Source', 'Target'])

I'm assuming the order of the output doesn't matter. The output of this code will be sorted alphabetically by 'Full Name'.
If you need to maintain the order you would need to modify the df.sort_values line to sort according to the original order of 'Full Name' instead of alphabetically.

edited Dec 24, 2019 at 16:43

answered Dec 24, 2019 at 15:00

Juan Estevez

8667 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

drkostas Over a year ago

That's almost perfect! It works but it doesn't go only from left to right but from right to left too. It produces rows that have Source: xxxx_2019 and Target: xxxx_2015. How easy is it to be fixed?

Juan Estevez Over a year ago

@drkostas The code was relying on the columns being sorted by ascending year. I have modified the df.sort_value line to also sort by year to avoid this. Please try it again.

Collectives™ on Stack Overflow

Return multiple rows per row for pandas DataFrame

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related