Pandas combine duplicate columns that contain strings

Question

I'm having problems (sort of) with combining duplicate columns. It seems to work on older versions of Pandas/Python (not sure what the culprit is here), but not on the latest version.

I basically have a dataframe of mixed values with duplicate column names after a concat. The values are either going to be a int, string, or nan. All non-nan values are going to be the same for each duplicate column name, so in theory max() should do the trick.

Say I have the Dataframe:

    col1  col1  col2  col2  col3
0   Foo   nan   nan   Bar   Baz
1   nan   nan   Bar   Bar   nan
2   0     nan   1     nan   1

My goal is to get

    col1  col2  col3
0   Foo   Bar   Baz
1   nan   Bar   nan
2   0     1     1

Doing this

df.groupby(df.columns,axis=1).max()

Does exactly what I want it to do on a older version of Pandas/Python, but does not work for the latest. This is what I'm getting on the latest version:

    col1  col2  col3
0   nan   nan   Baz
1   nan   nan   nan
2   0     1     1

Any ideas?

Because the values of each column is either going to be duplicates, or nan. I'll edit that in. — jaykayrowling
– jaykayrowling, Commented Jul 26, 2018 at 1:58

Lambda · Accepted Answer · 2018-07-26 02:26:18Z

1

I think you need transpose the dataframe firstly, reset the index, and then rename the duplicate index column values, and lastly use groupby.

df_t = df.T.reset_index()
df_t["index"] = df_t["index"].str.split(".").str[0]
result = df_t.groupby("index").first().T

output

Out[57]: 
index col1 col2 col3
0      Foo  Bar  Baz
1      NaN  Bar  NaN
2        0    1    1

answered Jul 26, 2018 at 2:26

Lambda

1,3941 gold badge9 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rafaelc Over a year ago

He doesn't want the first value, but the max value of each row

Lambda Over a year ago

@RafaelC I think here max means first according to his explanation. Maybe I'm wrong, but I have no idea about how to get the max value between string and integer.

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

Guess the problem arises when you try to compare strings with np.nan

A workaround would be to use empty string instead of np.nans

df.fillna('').groupby(df.columns, axis=1).max()

    bar baz foo
0   Bar Baz Foo
1   Bar     
2   1   1   0

Can go back to having np.nans afterwards if needed

.replace('', np.nan)

    bar baz foo
0   Bar Baz Foo
1   Bar NaN NaN
2   1   1   0

`edit`

If you don't want to use a workaround, or if '' might be present in your data frame, you can define your own max function and use it to aggregate

def mmax(s):
    s = [z for z in s if not pd.isnull(z)]
    if not len(s): return np.nan
    return max(s)

def a(s):
    return(s.agg(mmax, axis=1))

df.groupby(df.columns, axis=1).agg(a)

Outputs

bar baz foo
0   Bar Baz Foo
1   Bar NaN NaN
2   1   1   0

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jul 26, 2018 at 2:24

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

2 Comments

jaykayrowling Over a year ago

df.fillna('').groupby(df.columns, axis=1).max() almost works, but seems to do weird things when comparing '' with an INT. I did some extre preprocessing to get around that but it seems to work.

rafaelc Over a year ago

@jaykayrowling did you check the edit? :) You can put whatever key you want in mmax func, so that will make very flexible. From your example, seemed that a row would have only int (or str) and np.nan

Woods Chen · Accepted Answer · 2018-07-26 02:14:30Z

0

your algorithm is quite a good one, can you try:

df.groupby(df.columns,axis=1).max(axis=1)

answered Jul 26, 2018 at 2:14

Woods Chen

6204 silver badges14 bronze badges

1 Comment

Woods Chen Over a year ago

I tried, failed, because nan will be returned when comparing nan with strings. so I tried df.fillna('').groupby(df.columns,axis=1).max(axis=1) as @RafrelC posted, that works fine.

Collectives™ on Stack Overflow

Pandas combine duplicate columns that contain strings

3 Answers 3

2 Comments

`edit`

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

edit

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

`edit`