1

I'm having problems (sort of) with combining duplicate columns. It seems to work on older versions of Pandas/Python (not sure what the culprit is here), but not on the latest version.

I basically have a dataframe of mixed values with duplicate column names after a concat. The values are either going to be a int, string, or nan. All non-nan values are going to be the same for each duplicate column name, so in theory max() should do the trick.

Say I have the Dataframe:

    col1  col1  col2  col2  col3
0   Foo   nan   nan   Bar   Baz
1   nan   nan   Bar   Bar   nan
2   0     nan   1     nan   1

My goal is to get

    col1  col2  col3
0   Foo   Bar   Baz
1   nan   Bar   nan
2   0     1     1

Doing this

df.groupby(df.columns,axis=1).max()

Does exactly what I want it to do on a older version of Pandas/Python, but does not work for the latest. This is what I'm getting on the latest version:

    col1  col2  col3
0   nan   nan   Baz
1   nan   nan   nan
2   0     1     1

Any ideas?

3
  • What is the logic of max here? max of what Commented Jul 26, 2018 at 1:51
  • 1
    Because the values of each column is either going to be duplicates, or nan. I'll edit that in. Commented Jul 26, 2018 at 1:58
  • 2
    df.groupby(df.columns, axis=1).first() Commented Jul 26, 2018 at 1:59

3 Answers 3

1

I think you need transpose the dataframe firstly, reset the index, and then rename the duplicate index column values, and lastly use groupby.

df_t = df.T.reset_index()
df_t["index"] = df_t["index"].str.split(".").str[0]
result = df_t.groupby("index").first().T

output

Out[57]: 
index col1 col2 col3
0      Foo  Bar  Baz
1      NaN  Bar  NaN
2        0    1    1
Sign up to request clarification or add additional context in comments.

2 Comments

He doesn't want the first value, but the max value of each row
@RafaelC I think here max means first according to his explanation. Maybe I'm wrong, but I have no idea about how to get the max value between string and integer.
0

Guess the problem arises when you try to compare strings with np.nan

A workaround would be to use empty string instead of np.nans

df.fillna('').groupby(df.columns, axis=1).max()

    bar baz foo
0   Bar Baz Foo
1   Bar     
2   1   1   0

Can go back to having np.nans afterwards if needed

.replace('', np.nan)

    bar baz foo
0   Bar Baz Foo
1   Bar NaN NaN
2   1   1   0

edit

If you don't want to use a workaround, or if '' might be present in your data frame, you can define your own max function and use it to aggregate

def mmax(s):
    s = [z for z in s if not pd.isnull(z)]
    if not len(s): return np.nan
    return max(s)

def a(s):
    return(s.agg(mmax, axis=1))

df.groupby(df.columns, axis=1).agg(a)

Outputs

bar baz foo
0   Bar Baz Foo
1   Bar NaN NaN
2   1   1   0

2 Comments

df.fillna('').groupby(df.columns, axis=1).max() almost works, but seems to do weird things when comparing '' with an INT. I did some extre preprocessing to get around that but it seems to work.
@jaykayrowling did you check the edit? :) You can put whatever key you want in mmax func, so that will make very flexible. From your example, seemed that a row would have only int (or str) and np.nan
0

your algorithm is quite a good one, can you try:

df.groupby(df.columns,axis=1).max(axis=1)

1 Comment

I tried, failed, because nan will be returned when comparing nan with strings. so I tried df.fillna('').groupby(df.columns,axis=1).max(axis=1) as @RafrelC posted, that works fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.