1

Python newbie attempting a complex pandas dataframe logic

I have multiple dataframes I need to join but I'll show two below for the example. The dataframe have duplicate columns labelled with suffix '_duplicate'. I need to replicate the row instead of having the duplicate column as seen below.

My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.

List_of_columns = ["a", "b", "c", "d", "id"]

Dataframe1:X

a b a_duplicate b_duplicate c id
1 2 3 4 5 id1

Dataframe2:Y

a c a_duplicate c_duplicate d id
6 7 8 9 10 id2

Created dataframe:

a b c d id
1 2 5 Null id1
3 4 5 Null id1
6 Null 7 10 id2
8 Null 7 10 id2
2
  • Is this a situation of "treating the symptom not the disease"? Maybe it would be easier to correctly make the dataframes first before trying to fix these ones? Commented Jul 26, 2022 at 16:47
  • Makes sense, I can reformat the individual dataframes first by appending a replicated row at the bottom before joining. How would I go about this replication of rows for the duplicated column? Any starting point? Commented Jul 26, 2022 at 16:57

2 Answers 2

2

This is a very silly way of doing it and I am hoping someone comes up with a better way... but it does work:

##################### Recreate OP's dataframe ###########################
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
#########################################################################

# Append columns together while renaming the duplicate columns
df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))
df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))

# Concatenate the resulting datafraames, reset the index, then put it in the correct column order
df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]
df3

Output:

    a   b   c   d       id
0   1   2.0 5   NaN     id1
1   3   4.0 5   NaN     id1
2   6   NaN 7   10.0    id2
3   8   NaN 9   10.0    id2

~~ For OP's Comment ~~

This is pretty hacky but should be able to go through all of your 50 dataframes and correct them then combine them into a master dataframe. You will have to come up with your own way of looping through all of them (this codes places all of them in a dataframeList then cycles through those dataframes). I don't know how long it will take as I don't know how big your data is but... it' worth a shot.

data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
data3 = {"a":3, "b":2, "c":7, "a_duplicate":15,"b_duplicate":20, "c_duplicate":9,"d":10, "id":"id3"}
data4 = {"a":4, "d":3, "c":5, "a_duplicate":7,"d_duplicate":15, "c_duplicate":9,"d":10, "id":"id4"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
df3 = pd.DataFrame(data3, index=[0])
df4 = pd.DataFrame(data4, index=[0])
dataframeList = [df1, df2, df3, df4]

finalDF = pd.DataFrame(columns=["a", "b", "c", "d", "id"])

for df in dataframeList:
    notDup = [x for x in df.columns if "_duplicate" not in x]                   # Find column names that are not duplicated
    isDup  = list(set(df.columns)-set(notDup))                                  # Find duplicate column names
    dupColumns = isDup + list(set(notDup) - {x.split("_")[0] for x in isDup})   # Create list of column names for duplicated dataframe
    dupDF = df[dupColumns]                                                      # set the duplicate dataframe to be these columns

    for dup in isDup:                                                           # Cycle through every duplicated column name and rename it
        letter = dup.split("_")[0]                                              # to just the column name without "_duplicate"
        dupDF = dupDF.rename(columns={dup:letter})

    df = df[notDup].append(dupDF)                                               # Append the not duplicated columns with the duplicated columns

    finalDF = pd.concat([finalDF, df], ignore_index = True)                     # Concatenate all of them into one master dataframe

Output:

a   b   c   d   id
0   1   2   5   NaN id1
1   3   4   5   NaN id1
2   6   NaN 7   10  id2
3   8   NaN 9   10  id2
4   3   2   7   10  id3
5   15  20  9   10  id3
6   4   NaN 5   10  id4
7   7   NaN 9   15  id4
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the soln, works well! But i'm looking to join 50+ files within a directory is there any other way to do so without appending the previous df to the new df 50 times or so?
@MK2121 check the edit. I'm working off of limited data but if what you've provided me is the case, this should work (or it should get you started at least)
1

You can try

def explode(df):
    duplicate_cols = (df.columns.str.extract('(.*)_duplicate')
                      .dropna()[0].tolist())
    unduplicate_cols = (df.columns.difference(duplicate_cols)
                        .to_series()
                        [lambda s: ~s.str.contains('_duplicate')].tolist())
    out = df.T.groupby(df.columns.str.split('_').str[0]).agg(list).T
    out = (out.explode(duplicate_cols, ignore_index=True)
           .explode(unduplicate_cols, ignore_index=True))
    return out

out = pd.concat([explode(df1), explode(df2)], ignore_index=True)
print(out)

   a    b  c   id    d
0  1    2  5  id1  NaN
1  3    4  5  id1  NaN
2  6  NaN  7  id2   10
3  8  NaN  9  id2   10

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.