2

I have a data frame that looks something like this: (there are about 100 more columns irrelevant to my conditional column calculation)

col1     col2     col3
a        NaN      NaN
b        NaN      NaN
NaN      a        NaN
NaN      b        NaN
NaN      NaN      a
NaN      NaN      b

I need to add a column to put those values together so that it looks like this:

col1     col2     col3     col4
a        NaN      NaN      a
b        NaN      NaN      b
NaN      a        NaN      a
NaN      b        NaN      b
NaN      NaN      a        a
NaN      NaN      b        b

I'm trying to use something like this (which has worked for other conditions, such as searching for specific strings):

df['col4'] = [x if (~pd.isnull(x)) else y if (~pd.isnull(y)) else z if (~pd.isnull(z)) else '' for x,y,z in zip(df['col1'], df['col2'], df['col3])

However, this only performs the first test condition and sets the rest as NaN, even if I set the else condition to set the rest as empty strings. It looks like this:

col1     col2     col3     col4
a        NaN      NaN      a
b        NaN      NaN      b
NaN      a        NaN      NaN
NaN      b        NaN      NaN
NaN      NaN      a        NaN
NaN      NaN      b        NaN

Could anyone help explain why this isn't working (and what these kinds of "functions" are called?)

Edit: to clarify, there are other columns as well, but I'm not concerned about their values in the calculation for 'col4'

2 Answers 2

4

Let us try bfill

df['col4']=df.bfill(1).iloc[:,0]
df
Out[107]: 
  col1 col2 col3 col4
0    a  NaN  NaN    a
1    b  NaN  NaN    b
2  NaN    a  NaN    a
3  NaN    b  NaN    b
4  NaN  NaN    a    a
5  NaN  NaN    b    b
Sign up to request clarification or add additional context in comments.

1 Comment

This works as well, but what does the bfill(1) 1 argument do, and why is the iloc needed?
3

stack and groupby with first

df.assign(col4=df.stack().groupby(level=0).first())

  col1 col2 col3 col4
0    a  NaN  NaN    a
1    b  NaN  NaN    b
2  NaN    a  NaN    a
3  NaN    b  NaN    b
4  NaN  NaN    a    a
5  NaN  NaN    b    b

argmin and lookup

a = df.isna().to_numpy()
j = a.argmin(axis=1)
df.assign(col4=df.lookup(df.index, df.columns[j]))

  col1 col2 col3 col4
0    a  NaN  NaN    a
1    b  NaN  NaN    b
2  NaN    a  NaN    a
3  NaN    b  NaN    b
4  NaN  NaN    a    a
5  NaN  NaN    b    b

numpy.select

conditions = df.notna().to_numpy().T
selections = [c.to_numpy() for _, c in df.iteritems()]
df.assign(col4=np.select(conditions, selections))

  col1 col2 col3 col4
0    a  NaN  NaN    a
1    b  NaN  NaN    b
2  NaN    a  NaN    a
3  NaN    b  NaN    b
4  NaN  NaN    a    a
5  NaN  NaN    b    b

4 Comments

How would these work if I have about 100 other columns and i want to choose the columns that are relevant, without having to find their indices? The last one seems way too complicated...
create a new dataframe df_new = df[columns_I_care_about]. With the first concept df.assign(col4=df[columns_I_care_about].stack().groupby(level=0).first())
thanks! that worked! I want to understand more of how this works. Would you mind explaining why the stack() portion is used? the groupby i assume is to horizontally collapse the columns to non-nulls based on the relevant columns, and first chooses the first instance from left to right in case there are multiple columns with non-nulls? What if, say if all three columns had non-nulls and i wanted col2 to take precedence over the others?
The elimination of the nulls via stack is coincidental. first after a groupby would've picked the first non-null value anyway. I used stack because it was syntactically convenient to do a groupby(level=0) afterwards. Otherwise I'd have to do something obnoxious like df.assign(col4=df.groupby(lambda x: 0, axis=1).first())

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.