1

I have a DF with about 50 columns. 5 of them contain strings that I want to combine into a single column, separating the strings with commas but also keeping the spaces within each of the strings. Moreover, some values are missing (NaN). The last requirement would be to remove duplicates if they exist.

So I have something like this in my DF:

symptom_1 symptom_2 symptom_3 symptom_4 symptom 5
muscle pain super headache diarrhea Sore throat Fatigue
super rash ulcera super headache
diarrhea super diarrhea
something awful something awful

And I need something like this:

symptom_1 symptom_2 symptom_3 symptom_4 symptom 5 all_symptoms
muscle pain super headache diarrhea Sore throat Fatigue muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash ulcera super headache super rash, ulcera, headache
diarrhea super diarrhea diarrhea, super diarrhea
something awful something awful something awful

I wrote the following function and while it merges all the columns it does not respect the spaces within the original strings, which is a must.

def merge_columns_into_one(DataFrame, columns_to_combine, new_col_name, drop_originals = False):
    DataFrame[new_col_name] = DataFrame[columns_to_combine].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return DataFrame

Thanks in advance for your help!

edit: when I'm writing this question the second markdown table appears just fine in the preview, but as soon as I post it the table loses it's format. I hope you get the idea of what I'm trying to do. Else I'd appreciate your feedback on how to fix the MD table.

3
  • Can you reformat the output you're looking for? Commented Apr 8, 2021 at 2:36
  • Yeah, I've been trying, it looks just fine in the preview but as soon as I post the question it loses the MD table format. Do you have any suggestions or workarounds for this? Perhaps I can post an image? Commented Apr 8, 2021 at 2:40
  • Seems that it magically fixed itself :) Commented Apr 8, 2021 at 2:41

5 Answers 5

2

Just use fillna() , apply() and rstrip() method:

df['all_symptoms']=df1.fillna('').apply(pd.unique,1).apply(','.join).str.rstrip(',')

Now if you print df you will get your desired output:

symptom_1 symptom_2 symptom_3 symptom_4 symptom 5 all_symptoms
muscle pain super headache diarrhea Sore throat Fatigue muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash ulcera super headache super rash, ulcera, headache
diarrhea super diarrhea diarrhea, super diarrhea
something awful something awful something awful
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, it was a really clean and clear one-liner solution.
2

You can use pandas str.cat, with some massaging:

(df
 .fillna("")
 .assign(all_symptoms = lambda df: df.iloc[:, 0]
                                    .str.cat(df.iloc[:, 1:], 
                                             sep=',')
                                    .str.strip(",")
                                    .str.split(",")
                                    .map(pd.unique)
                                    .str.join(","))
         )

         symptom_1        symptom_2       symptom_3    symptom_4 symptom 5                                       all_symptoms
0      muscle pain   super headache        diarrhea  Sore throat   Fatigue  muscle pain,super headache,diarrhea,Sore throa...
1       super rash           ulcera  super headache                                          super rash,ulcera,super headache
2         diarrhea   super diarrhea                                                                   diarrhea,super diarrhea
3  something awful  something awful                                                                           something awful

Alternatively, you could run the string operations within plain python, which is usually faster than pandas string methods (they are wrappers around python's string methods anyways):

df = df.fillna("")

_, strings = zip(*df.items())

strings = zip(*strings)

strings = map(pd.unique, strings)

strings = map(",".join, strings)

df['all_symptoms'] = [entry.strip(",") for entry in strings]

3 Comments

any speed improvement is really welcomed. Therefore I tried it with no avail: python columns = ['symptom_1', 'symptom_2', 'symptom_3', 'symptom_4', 'symptom_5'] data[columns].fillna('') _,strings = zip(*data_3[columns].items()) strings = zip(*strings) strings = map(pd.unique, strings) strings = map(",".join, strings) ## debugg print print(data[columns].info()) ## data['all_symptoms'] = [entry.strip(",") for entry in strings] err: console data['all_symptoms'] = [entry.strip(",") for entry in strings] TypeError: sequence item 3: expected str instance, float found
I don't know why I got the formatting so bad in my previous comment. But I double checked an my columns only contain strings: print(data[columns].info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symptom_1 4 non-null object 1 symptom_2 4 non-null object 2 symptom_3 2 non-null object 3 symptom_4 1 non-null object 4 symptom_5 1 non-null object dtypes: object(5)
not sure why it is seeing a float; might require more investigation; glad though that you got your challenge resolved with the chosen answer
0

Here is an example that you can get an idea how to work around it:

import pandas as pd

df1 = pd.DataFrame(
     {
       "A": ["A0", "A1", "A2", "A3"],
       "B": ["B0", "B1", "B2", "B3"],
       "C": ["C0", "C1", "C2", "C3"],
       "D": ["D0", "D1", "D2", "D3"],
     },
index=[0, 1, 2, 3],)

df1["merged"] = df1["A"]+"," + df1["B"]+","+df1["C"]+","+df1["D"]

Comments

0

First, apply lambda function, then use set() to collect all the symptoms ignoring repetitions, finally, use simple list comprehension and then join the list elements with a comma using join().

df['all_symptoms'] = df.apply(lambda row: ",".join([x for x in set(row) if x is not None]),1)

This will return all the symptoms separated by a comma.

Comments

0

Hope below code might help you

import pandas as pd

data = {
        'symptom_1' : ["muscle pain", "super rash", "diarrhea", "something awful"], 
        'symptom_2' : ["super headache", "ulcera", "super diarrhea", "something awful"],
        'symptom_3' :["diarrhea", "super headache"],
        'symptom_4' : ["Sore throat"],
        'symptom_5' :["Fatigue"]
        }
df = pd.DataFrame (data, columns = ['symptom_1','symptom_2']) 
df1 = pd.DataFrame (data, columns = ['symptom_3'])
df2 = pd.DataFrame (data, columns = ['symptom_4','symptom_5'])

new = pd.concat([df, df1, df2], axis=1) 

new['symptom_6'] = new['symptom_1']+","+new['symptom_2']+","+new['symptom_3'].fillna('')+","+new['symptom_4'].fillna('')+","+new['symptom_5'].fillna('')

print(new)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.