7

I need to concat the strings in 2 or more columns of a pandas dataframe.

I found this answer, which works fine if you don't have any missing value. Unfortunately, I have, and this leads to things like "ValueA; None", which is not really clean.

Example data:

col_A  | col_B
------ | ------
val_A  | val_B 
None   | val_B 
val_A  | None 
None   | None

I need this result:

col_merge
---------
val_A;val_B
val_B
val_A
None
2
  • 2
    have you tried using fillna with an empty string '' on col_b? Commented Aug 31, 2017 at 8:20
  • Just did, but in case I have a NaN in the first columns, I get ";val_B". With Nan in both columns I just get ";" Commented Aug 31, 2017 at 8:27

1 Answer 1

14

You can use apply with if-else:

df = df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
print (df)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object

For faster solution is possible use:

#add separator and replace NaN to empty space
#convert to lists
arr = df.add('; ').fillna('').values.tolist()
#list comprehension, replace empty spaces to NaN
s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
#replace NaN to None
s = s.where(s.notnull(), None)
print (s)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object

#40000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

In [70]: %%timeit
    ...: arr = df.add('; ').fillna('').values.tolist()
    ...: s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
    ...: s.where(s.notnull(), None)
    ...: 
10 loops, best of 3: 74 ms per loop


In [71]: %%timeit
    ...: df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
    ...: 
1 loop, best of 3: 12.7 s per loop

#another solution, but slowier a bit
In [72]: %%timeit
     ...: arr = df.add('; ').fillna('').values  
     ...: s = [''.join(x).strip('; ') for x in arr]
     ...: pd.Series([y if y != '' else None for y in s])
     ...: 
     ...: 
10 loops, best of 3: 119 ms per loop
Sign up to request clarification or add additional context in comments.

2 Comments

nice answer to a surprisingly hard problem. .join() and .cat() both surprisingly fail here
I'm not sure why, but all solutions I found, including the top solution always gave me a separator, like the "; ", if col_B was empty. The 'faster' solution was the only way I found to concatenate values with a separator and handle np.nans in the second column. Thank you Mr Jazrael once again!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.