1

I have the following dataframe that lists information on fast food road stops.

Input

first_stop   second_stop     third_stop
mcdonalds    burger king     popeyes
mcdonalds    N/A             N/A
wendys       kfc             N/A
taco bell    kfc             wendys
popeyes     kfc              panda express

I want to create a new column summary that summarizes the stops like so: Expected Output

first_stop   second_stop     third_stop        summary
mcdonalds    burger king     popeyes           mcdonalds -> burger king -> popeyes 
mcdonalds    N/A             N/A               mcdonalds
wendys       kfc             N/A               wendys -> kfc
taco bell    kfc             wendys            taco bell -> kfc -> wendys
popeyes      kfc             panda express     popeyes -> kfc -> panda express

I cannot simply concatenate the three stop columns because some have N/A values if the stop did not exist. How can i do this in pandas?

I've tried this, but obviously it won't give me what i want:

df['summary'] = df['first_stop'] + '->' + df['second_stop'] + '->' + df['third_stop']

2 Answers 2

1

Use stack to flatten your dataframe. stack drop NaN values by default then groupby index level 0 and finally join strings.

df['summary'] = df.stack().groupby(level=0).apply(lambda x: ' -> '.join(x))
print(df)

# Output
  first_stop  second_stop     third_stop                              summary
0  mcdonalds  burger king        popeyes  mcdonalds -> burger king -> popeyes
1  mcdonalds          NaN            NaN                            mcdonalds
2     wendys          kfc            NaN                        wendys -> kfc
3  taco bell          kfc         wendys           taco bell -> kfc -> wendys
4    popeyes          kfc  panda express      popeyes -> kfc -> panda express
Sign up to request clarification or add additional context in comments.

Comments

0

If you have a large dataset you could use a classical loop that will be faster than stack+groupby:

df['summary'] = df.apply(lambda s: ' -> '.join(e for e in s if not pd.isna(e)),
                         axis=1)

or to stop on the first NA:

from itertools import takewhile
df['summary'] = df.apply(lambda s: ' -> '.join(
                         takewhile(lambda x: not pd.isna(x), s)
                         ), axis=1)

Or, for this particular ' -> ' separator where the characters are not expected to be found in the words:

df['summary'] = df.fillna('').apply(' -> '.join, axis=1).str.rstrip('>- ')

NB. this is a trick, doesn't work on all separators

output:

  first_stop  second_stop     third_stop                              summary
0  mcdonalds  burger king        popeyes  mcdonalds -> burger king -> popeyes
1  mcdonalds          NaN            NaN                            mcdonalds
2     wendys          kfc            NaN                        wendys -> kfc
3  taco bell          kfc         wendys           taco bell -> kfc -> wendys
4    popeyes          kfc  panda express      popeyes -> kfc -> panda express

3 Comments

The classical loop isn't faster. I have a sparse dataset where only 90k rows were filled in and the stack & groupby approach is +/- 25% faster (using %%timeit in a notebook)
@Sam thanks for the feedback. Timings are tricky, they depend on the version of the library and the exact dataset. On the current example apply runs in 185 µs ± 3.7 µs vs 826 µs ± 14.6 µs for the stack+groupby.apply. On 50k rows, this is 226 ms ± 28.5 ms vs 1.49 s ± 9.7 ms. That's why it's always important to test on the real dataset. In your case stack will get rid of most of the values as you have many NaNs, in this case you can improve the loop approach by only working on the rows that don't have only NaNs (df[df.notna().any(axis=1)].apply(…)), which should be faster ;)
Thanks for the meaningful insights! Honestly I was hoping to counter this issue with df[col].str.cat(others=df[other_col], sep="->") but the default arguments don't cover this case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.