Create pandas dataframe on column name conditions

Question

Python newbie attempting a complex pandas dataframe logic

I have multiple dataframes I need to join but I'll show two below for the example. The dataframe have duplicate columns labelled with suffix '_duplicate'. I need to replicate the row instead of having the duplicate column as seen below.

My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.

List_of_columns = ["a", "b", "c", "d", "id"]

Dataframe1:X

a	b	a_duplicate	b_duplicate	c	id
1	2	3	4	5	id1

Dataframe2:Y

a	c	a_duplicate	c_duplicate	d	id
6	7	8	9	10	id2

Created dataframe:

a	b	c	d	id
1	2	5	Null	id1
3	4	5	Null	id1
6	Null	7	10	id2
8	Null	7	10	id2

Is this a situation of "treating the symptom not the disease"? Maybe it would be easier to correctly make the dataframes first before trying to fix these ones? — Michael S.
– Michael S., Commented Jul 26, 2022 at 16:47
Makes sense, I can reformat the individual dataframes first by appending a replicated row at the bottom before joining. How would I go about this replication of rows for the duplicated column? Any starting point? — MK2121
– MK2121, Commented Jul 26, 2022 at 16:57

Michael S. · Accepted Answer · 2022-07-26 18:08:21Z

This is a very silly way of doing it and I am hoping someone comes up with a better way... but it does work:

##################### Recreate OP's dataframe ###########################
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
#########################################################################

# Append columns together while renaming the duplicate columns
df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))
df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))

# Concatenate the resulting datafraames, reset the index, then put it in the correct column order
df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]
df3

Output:

    a   b   c   d       id
0   1   2.0 5   NaN     id1
1   3   4.0 5   NaN     id1
2   6   NaN 7   10.0    id2
3   8   NaN 9   10.0    id2

~~ For OP's Comment ~~

This is pretty hacky but should be able to go through all of your 50 dataframes and correct them then combine them into a master dataframe. You will have to come up with your own way of looping through all of them (this codes places all of them in a dataframeList then cycles through those dataframes). I don't know how long it will take as I don't know how big your data is but... it' worth a shot.

data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
data3 = {"a":3, "b":2, "c":7, "a_duplicate":15,"b_duplicate":20, "c_duplicate":9,"d":10, "id":"id3"}
data4 = {"a":4, "d":3, "c":5, "a_duplicate":7,"d_duplicate":15, "c_duplicate":9,"d":10, "id":"id4"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
df3 = pd.DataFrame(data3, index=[0])
df4 = pd.DataFrame(data4, index=[0])
dataframeList = [df1, df2, df3, df4]

finalDF = pd.DataFrame(columns=["a", "b", "c", "d", "id"])

for df in dataframeList:
    notDup = [x for x in df.columns if "_duplicate" not in x]                   # Find column names that are not duplicated
    isDup  = list(set(df.columns)-set(notDup))                                  # Find duplicate column names
    dupColumns = isDup + list(set(notDup) - {x.split("_")[0] for x in isDup})   # Create list of column names for duplicated dataframe
    dupDF = df[dupColumns]                                                      # set the duplicate dataframe to be these columns

    for dup in isDup:                                                           # Cycle through every duplicated column name and rename it
        letter = dup.split("_")[0]                                              # to just the column name without "_duplicate"
        dupDF = dupDF.rename(columns={dup:letter})

    df = df[notDup].append(dupDF)                                               # Append the not duplicated columns with the duplicated columns

    finalDF = pd.concat([finalDF, df], ignore_index = True)                     # Concatenate all of them into one master dataframe

Output:

a   b   c   d   id
0   1   2   5   NaN id1
1   3   4   5   NaN id1
2   6   NaN 7   10  id2
3   8   NaN 9   10  id2
4   3   2   7   10  id3
5   15  20  9   10  id3
6   4   NaN 5   10  id4
7   7   NaN 9   15  id4

Thanks for the soln, works well! But i'm looking to join 50+ files within a directory is there any other way to do so without appending the previous df to the new df 50 times or so?
@MK2121 check the edit. I'm working off of limited data but if what you've provided me is the case, this should work (or it should get you started at least)

Ynjxsjmh · Accepted Answer · 2022-07-26 17:26:31Z

1

You can try

def explode(df):
    duplicate_cols = (df.columns.str.extract('(.*)_duplicate')
                      .dropna()[0].tolist())
    unduplicate_cols = (df.columns.difference(duplicate_cols)
                        .to_series()
                        [lambda s: ~s.str.contains('_duplicate')].tolist())
    out = df.T.groupby(df.columns.str.split('_').str[0]).agg(list).T
    out = (out.explode(duplicate_cols, ignore_index=True)
           .explode(unduplicate_cols, ignore_index=True))
    return out

out = pd.concat([explode(df1), explode(df2)], ignore_index=True)

print(out)

   a    b  c   id    d
0  1    2  5  id1  NaN
1  3    4  5  id1  NaN
2  6  NaN  7  id2   10
3  8  NaN  9  id2   10

answered Jul 26, 2022 at 17:26

Ynjxsjmh

30.3k7 gold badges43 silver badges64 bronze badges

Collectives™ on Stack Overflow

Create pandas dataframe on column name conditions

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related