Concatenating string columns but only if not N/A in pandas

Question

I have the following dataframe that lists information on fast food road stops.

Input

first_stop   second_stop     third_stop
mcdonalds    burger king     popeyes
mcdonalds    N/A             N/A
wendys       kfc             N/A
taco bell    kfc             wendys
popeyes     kfc              panda express

I want to create a new column summary that summarizes the stops like so: Expected Output

first_stop   second_stop     third_stop        summary
mcdonalds    burger king     popeyes           mcdonalds -> burger king -> popeyes 
mcdonalds    N/A             N/A               mcdonalds
wendys       kfc             N/A               wendys -> kfc
taco bell    kfc             wendys            taco bell -> kfc -> wendys
popeyes      kfc             panda express     popeyes -> kfc -> panda express

I cannot simply concatenate the three stop columns because some have N/A values if the stop did not exist. How can i do this in pandas?

I've tried this, but obviously it won't give me what i want:

df['summary'] = df['first_stop'] + '->' + df['second_stop'] + '->' + df['third_stop']

Corralien · Accepted Answer · 2022-01-26 15:33:59Z

1

Use stack to flatten your dataframe. stack drop NaN values by default then groupby index level 0 and finally join strings.

df['summary'] = df.stack().groupby(level=0).apply(lambda x: ' -> '.join(x))
print(df)

# Output
  first_stop  second_stop     third_stop                              summary
0  mcdonalds  burger king        popeyes  mcdonalds -> burger king -> popeyes
1  mcdonalds          NaN            NaN                            mcdonalds
2     wendys          kfc            NaN                        wendys -> kfc
3  taco bell          kfc         wendys           taco bell -> kfc -> wendys
4    popeyes          kfc  panda express      popeyes -> kfc -> panda express

answered Jan 26, 2022 at 15:33

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mozway · Accepted Answer · 2022-01-26 16:16:15Z

0

If you have a large dataset you could use a classical loop that will be faster than stack+groupby:

df['summary'] = df.apply(lambda s: ' -> '.join(e for e in s if not pd.isna(e)),
                         axis=1)

or to stop on the first NA:

from itertools import takewhile
df['summary'] = df.apply(lambda s: ' -> '.join(
                         takewhile(lambda x: not pd.isna(x), s)
                         ), axis=1)

Or, for this particular ' -> ' separator where the characters are not expected to be found in the words:

df['summary'] = df.fillna('').apply(' -> '.join, axis=1).str.rstrip('>- ')

NB. this is a trick, doesn't work on all separators

output:

  first_stop  second_stop     third_stop                              summary
0  mcdonalds  burger king        popeyes  mcdonalds -> burger king -> popeyes
1  mcdonalds          NaN            NaN                            mcdonalds
2     wendys          kfc            NaN                        wendys -> kfc
3  taco bell          kfc         wendys           taco bell -> kfc -> wendys
4    popeyes          kfc  panda express      popeyes -> kfc -> panda express

edited Jan 26, 2022 at 16:16

answered Jan 26, 2022 at 16:09

mozway

267k13 gold badges56 silver badges106 bronze badges

3 Comments

Sam Over a year ago

The classical loop isn't faster. I have a sparse dataset where only 90k rows were filled in and the stack & groupby approach is +/- 25% faster (using %%timeit in a notebook)

mozway Over a year ago

@Sam thanks for the feedback. Timings are tricky, they depend on the version of the library and the exact dataset. On the current example apply runs in 185 µs ± 3.7 µs vs 826 µs ± 14.6 µs for the stack+groupby.apply. On 50k rows, this is 226 ms ± 28.5 ms vs 1.49 s ± 9.7 ms. That's why it's always important to test on the real dataset. In your case stack will get rid of most of the values as you have many NaNs, in this case you can improve the loop approach by only working on the rows that don't have only NaNs (df[df.notna().any(axis=1)].apply(…)), which should be faster ;)

Sam Over a year ago

Thanks for the meaningful insights! Honestly I was hoping to counter this issue with df[col].str.cat(others=df[other_col], sep="->") but the default arguments don't cover this case.

Collectives™ on Stack Overflow

Concatenating string columns but only if not N/A in pandas

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related