How to combine DataFrame columns of strings into a single column?

Question

I have a DF with about 50 columns. 5 of them contain strings that I want to combine into a single column, separating the strings with commas but also keeping the spaces within each of the strings. Moreover, some values are missing (NaN). The last requirement would be to remove duplicates if they exist.

So I have something like this in my DF:

symptom_1	symptom_2	symptom_3	symptom_4	symptom 5
muscle pain	super headache	diarrhea	Sore throat	Fatigue
super rash	ulcera	super headache
diarrhea	super diarrhea
something awful	something awful

And I need something like this:

symptom_1	symptom_2	symptom_3	symptom_4	symptom 5	all_symptoms
muscle pain	super headache	diarrhea	Sore throat	Fatigue	muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash	ulcera	super headache			super rash, ulcera, headache
diarrhea	super diarrhea				diarrhea, super diarrhea
something awful	something awful				something awful

I wrote the following function and while it merges all the columns it does not respect the spaces within the original strings, which is a must.

def merge_columns_into_one(DataFrame, columns_to_combine, new_col_name, drop_originals = False):
    DataFrame[new_col_name] = DataFrame[columns_to_combine].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return DataFrame

Thanks in advance for your help!

edit: when I'm writing this question the second markdown table appears just fine in the preview, but as soon as I post it the table loses it's format. I hope you get the idea of what I'm trying to do. Else I'd appreciate your feedback on how to fix the MD table.

Yeah, I've been trying, it looks just fine in the preview but as soon as I post the question it loses the MD table format. Do you have any suggestions or workarounds for this? Perhaps I can post an image? — Luis
– Luis, Commented Apr 8, 2021 at 2:40

Anurag Dabas · Accepted Answer · 2021-08-21 16:27:33Z

2

Just use fillna() , apply() and rstrip() method:

df['all_symptoms']=df1.fillna('').apply(pd.unique,1).apply(','.join).str.rstrip(',')

Now if you print df you will get your desired output:

symptom_1	symptom_2	symptom_3	symptom_4	symptom 5	all_symptoms
muscle pain	super headache	diarrhea	Sore throat	Fatigue	muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash	ulcera	super headache			super rash, ulcera, headache
diarrhea	super diarrhea				diarrhea, super diarrhea
something awful	something awful				something awful

edited Aug 21, 2021 at 16:27

answered Apr 8, 2021 at 2:59

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Luis Over a year ago

Thank you, it was a really clean and clear one-liner solution.

sammywemmy · Accepted Answer · 2021-04-08 03:29:31Z

2

You can use pandas str.cat, with some massaging:

(df
 .fillna("")
 .assign(all_symptoms = lambda df: df.iloc[:, 0]
                                    .str.cat(df.iloc[:, 1:], 
                                             sep=',')
                                    .str.strip(",")
                                    .str.split(",")
                                    .map(pd.unique)
                                    .str.join(","))
         )

         symptom_1        symptom_2       symptom_3    symptom_4 symptom 5                                       all_symptoms
0      muscle pain   super headache        diarrhea  Sore throat   Fatigue  muscle pain,super headache,diarrhea,Sore throa...
1       super rash           ulcera  super headache                                          super rash,ulcera,super headache
2         diarrhea   super diarrhea                                                                   diarrhea,super diarrhea
3  something awful  something awful                                                                           something awful

Alternatively, you could run the string operations within plain python, which is usually faster than pandas string methods (they are wrappers around python's string methods anyways):

df = df.fillna("")

_, strings = zip(*df.items())

strings = zip(*strings)

strings = map(pd.unique, strings)

strings = map(",".join, strings)

df['all_symptoms'] = [entry.strip(",") for entry in strings]

edited Apr 8, 2021 at 3:29

answered Apr 8, 2021 at 3:19

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

3 Comments

Luis Over a year ago

any speed improvement is really welcomed. Therefore I tried it with no avail:

python columns = ['symptom_1', 'symptom_2', 'symptom_3', 'symptom_4', 'symptom_5'] data[columns].fillna('') _,strings = zip(*data_3[columns].items()) strings = zip(*strings) strings = map(pd.unique, strings) strings = map(",".join, strings) ## debugg print print(data[columns].info()) ## data['all_symptoms'] = [entry.strip(",") for entry in strings]

err:

console data['all_symptoms'] = [entry.strip(",") for entry in strings] TypeError: sequence item 3: expected str instance, float found

Luis Over a year ago

I don't know why I got the formatting so bad in my previous comment. But I double checked an my columns only contain strings: print(data[columns].info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symptom_1 4 non-null object 1 symptom_2 4 non-null object 2 symptom_3 2 non-null object 3 symptom_4 1 non-null object 4 symptom_5 1 non-null object dtypes: object(5)

sammywemmy Over a year ago

not sure why it is seeing a float; might require more investigation; glad though that you got your challenge resolved with the chosen answer

Hadi Rohani · Accepted Answer · 2021-04-08 02:52:27Z

0

Here is an example that you can get an idea how to work around it:

import pandas as pd

df1 = pd.DataFrame(
     {
       "A": ["A0", "A1", "A2", "A3"],
       "B": ["B0", "B1", "B2", "B3"],
       "C": ["C0", "C1", "C2", "C3"],
       "D": ["D0", "D1", "D2", "D3"],
     },
index=[0, 1, 2, 3],)

df1["merged"] = df1["A"]+"," + df1["B"]+","+df1["C"]+","+df1["D"]

edited Apr 8, 2021 at 2:52

answered Apr 8, 2021 at 2:45

Hadi Rohani

2251 gold badge4 silver badges15 bronze badges

Comments

Ransaka Ravihara · Accepted Answer · 2021-04-08 04:01:56Z

0

First, apply lambda function, then use set() to collect all the symptoms ignoring repetitions, finally, use simple list comprehension and then join the list elements with a comma using join().

df['all_symptoms'] = df.apply(lambda row: ",".join([x for x in set(row) if x is not None]),1)

This will return all the symptoms separated by a comma.

edited Apr 8, 2021 at 4:01

answered Apr 8, 2021 at 3:55

Ransaka Ravihara

2,0142 gold badges16 silver badges30 bronze badges

Comments

Deepak · Accepted Answer · 2021-04-08 06:36:29Z

Hope below code might help you

import pandas as pd

data = {
        'symptom_1' : ["muscle pain", "super rash", "diarrhea", "something awful"], 
        'symptom_2' : ["super headache", "ulcera", "super diarrhea", "something awful"],
        'symptom_3' :["diarrhea", "super headache"],
        'symptom_4' : ["Sore throat"],
        'symptom_5' :["Fatigue"]
        }
df = pd.DataFrame (data, columns = ['symptom_1','symptom_2']) 
df1 = pd.DataFrame (data, columns = ['symptom_3'])
df2 = pd.DataFrame (data, columns = ['symptom_4','symptom_5'])

new = pd.concat([df, df1, df2], axis=1) 

new['symptom_6'] = new['symptom_1']+","+new['symptom_2']+","+new['symptom_3'].fillna('')+","+new['symptom_4'].fillna('')+","+new['symptom_5'].fillna('')

print(new)

Collectives™ on Stack Overflow

How to combine DataFrame columns of strings into a single column?

5 Answers 5

1 Comment

3 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related