0

Sample Input DataFrame:

merged_df
                 Full Name   Kommata 2007     Kommata 2015                 Kommata 2019
0        Athanasios bouras   New democracy    New democracy                New democracy
1        Andreas loverdos    Pasok            Pasok-democratic alignment   Movement for change
2        Theodora tzakri     Pasok            Pasok                        Syriza
3        Thanasis zempilis   Pasok            NaN                          New democracy

Desired DataFrame:

edges_df

         Source                             Target         
0        New democracy_2007                 New democracy_2015
1        New democracy_2015                 New democracy_2019
2        Pasok_2007                         Pasok-democratic alignment_2015
3        Pasok-democratic alignment_2015    Movement for change_2019
4        Pasok_2007                         Pasok_2015
5        Pasok_2015                         Syriza_2019
6        Pasok_2007                         New democracy_2019

As implied above, I have an input DataFrame with n columns; the first one has unique values (Full Name) and the other n-1 (Kommata YYYY) are some attributes of the rows. I want to generate a new DataFrame with two columns as follows:

  • For each Full Name it will have 0 or more rows

  • Starting from the leftmost Kommata column, it takes every adjacent pair of not null values e.g. Kommata 2007-Kommata 2015, Kommata 2015-Kommata 2019; the pair Kommata 2007-Kommata 2019 can only exist if Kommata 2015 is null

  • Every pair will be a new row

  • Each column's value is modified like this: value_YYYY where the value remains the same and the YYYY is taken from the column name (e.g. '{}_{}'.format(prev_value, col_name.split()[-1]))

Thanks in advance

1 Answer 1

1

You can use pd.melt to do this:

# A list of columns to melt.
value_cols = list(df.columns)[1:]

# Melt said columns while leaving the others (in this case only 'Full Name') intact.
df = pd.melt(df, id_vars=['Full Name'], value_vars=value_cols)

# Get the year from 'variable'
df['variable'] = df['variable'].str.split(' ').apply(lambda x:x[-1])

# Sort the values by 'Full Name' and then year (required).
df = df.sort_values(by=['Full Name', 'variable'])

# Drop rows with empty values.
df = df.dropna()

df['Source'] = df['value'] + '_' + df['variable']

# Pair the values (This is why the previous sort is required).
df['Target'] = df['Source'].shift(-1)

# Remove rows where the values don't belong to the same name.
mask = df['Full Name'].eq(df['Full Name'].shift(-1).bfill())
df = df.loc[mask]

# Keep only relevant columns.
df = df.reindex(columns=['Source', 'Target'])

I'm assuming the order of the output doesn't matter. The output of this code will be sorted alphabetically by 'Full Name'.
If you need to maintain the order you would need to modify the df.sort_values line to sort according to the original order of 'Full Name' instead of alphabetically.

Sign up to request clarification or add additional context in comments.

2 Comments

That's almost perfect! It works but it doesn't go only from left to right but from right to left too. It produces rows that have Source: xxxx_2019 and Target: xxxx_2015. How easy is it to be fixed?
@drkostas The code was relying on the columns being sorted by ascending year. I have modified the df.sort_value line to also sort by year to avoid this. Please try it again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.