Pandas: Merge two string columns in Python, remove duplicated strings and remove unwanted string unless only unwanted string left

Question

I'm trying to merge two string columns and I wish to get rid of 'others' if the counter value is a 'non-others' value - like 'apple' + 'others' = 'apple' but 'others' + 'others' = 'others'. I managed the 2nd condition but how can I accommodate the two conditions on the merge?

data = {'fruit1':["organge", "apple", "organge", "organge", "others"],
        'fruit2':["apple", "others", "organge", "watermelon", "others"]}
df = pd.DataFrame(data)

df["together"] = df["fruit1"] + ' ' + df["fruit2"]
df["together"] = df["together"].apply(lambda x: ' '.join(pd.unique(x.split())))

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others        apple others
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

Expected output:

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

If you look at row index 1, that would be 'apple' in this case. So when joining the two columns, if there's another value other than 'others', I wish to remove the 'others' but keep the value like 'apple'. — codedancer
– codedancer, Commented Oct 6, 2021 at 8:52
What happens if you have an others in the first column? Like others, apple — Dani Mesejo
– Dani Mesejo, Commented Oct 6, 2021 at 8:54
In this case, you can look at row index 4 where 'others' + 'others' = 'others'. — codedancer
– codedancer, Commented Oct 6, 2021 at 8:55
But what about 'others' + 'apple'? Is the result 'apple' or 'others apple'? — Dani Mesejo
– Dani Mesejo, Commented Oct 6, 2021 at 8:58

Dani Mesejo · Accepted Answer · 2021-10-06 10:01:26Z

5

You want to replace only one "others", so simple join and then use str.replace once:

df["together"] = (df["fruit1"] + " " + df["fruit2"]).str.replace("others", "", n=1).str.strip()
print(df)

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge     organge organge
3  organge  watermelon  organge watermelon
4   others      others              others

The n parameter specifies the number of replacements to be made, from the documentation:

n int, default -1 (all)
Number of replacements to make from start.

UPDATE

To also remove duplicates use the following regular expression:

df["together"] = df["together"].str.replace(r"\b(\w+)\s+\1\b", r"\1", n=1, regex=True).str.strip()
print(df)

Output

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

See here an explanation of the regex.

edited Oct 6, 2021 at 10:01

answered Oct 6, 2021 at 9:05

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

codedancer Over a year ago

Elegantly done! Thanks!

SeaBean Over a year ago

@codedancer This does not remove duplicates. Note the organge organge in result.

Dani Mesejo Over a year ago

@codedancer Could you clarify on this?

codedancer Over a year ago

I think @SeaBean's comment is correct. It doesn't remove the duplicate value. I tried to make it 3 steps again - doing mine first and yours later but it is not working as expected.

SeaBean · Accepted Answer · 2021-10-06 09:54:44Z

4

You can replace others by NaN and dropna() during join followed by replacing empty string by a single others:

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
                   .apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
                   .replace('', 'others')
                 )

Or leveraging the str.replace with n=1 by @Dani (caution: won't work if ther are 3 columns to aggregate; may leave 2 instances of others) and combining the remove duplicate logics of OP, as follows:

df["together"] = (df["fruit1"] + " " + df["fruit2"]).str.replace("others", "", n=1).apply(lambda x: ' '.join(pd.unique(x.split())))

Result:

print(df)

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

edited Oct 6, 2021 at 9:54

answered Oct 6, 2021 at 9:08

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

6 Comments

codedancer Over a year ago

Just run through the thing again and you're right. Just checked yours and it works as expected. I was so focused on removing the others but ignored on the duplicate condition. Thanks for pointing out the porblem!

SeaBean Over a year ago

@codedancer You are welcome! And take note that if you have 3 columns to aggregate, it may also just remove one and leave with 2 others. Anyway, it's a very good feature to learn.

codedancer Over a year ago

I am wondering if there's a way we can prevent this when more columns are involved.

SeaBean Over a year ago

@codedancer My first solution replacing others by NaN works for more columns.

codedancer Over a year ago

Just tested it on a large scale, it works as you suggested!

|

Mahi · Accepted Answer · 2021-10-06 08:52:52Z

1

def merge_columns(df, col1, col2, new_col, unwanted_string):
    '''Merge two string columns and replace unwanted string with existing string'''
    df[new_col] = df[col1].astype(str) + df[col2].astype(str)
    df[new_col] = df[new_col].str.replace(unwanted_string, '')
    return df

answered Oct 6, 2021 at 8:52

Mahi

3603 silver badges11 bronze badges

Comments

vtasca · Accepted Answer · 2021-10-06 09:03:31Z

1

You can do it in a lambda function as follows (copy-pasting it will work):

df.together = df.together.apply(lambda x: x if 'others' not in x else ('others' if all([y == '' for y in x.split('others')]) else x.replace('others', '').strip()))

Giving you:

    fruit1  fruit2      together
0   organge apple       organge apple
1   apple   others      apple
2   organge organge     organge
3   organge watermelon  organge watermelon
4   others  others      others

answered Oct 6, 2021 at 9:03

vtasca

1,78014 silver badges18 bronze badges

Comments

albert · Accepted Answer · 2021-10-06 10:22:25Z

You can set the resulting value based on an if-statement and the fruit_x value. To do so, I suggest to replace() others with None which makes conditional checking and re-replacing very easy. Call .apply() with axis=1 to a perform row-wise operation which I implemented in a separate function concat_strings for sake of readability. In addition, I would chain all required operations in a single statement in order to not change the original data set.

A very basic approach could look like this:

import pandas as pd


def concat_strings(row):
    fruit_1 = row['fruit1']
    fruit_2 = row['fruit2']

    if fruit_1 == fruit_2:
        return fruit_1
    elif fruit_1 and fruit_2:
        return fruit_1 + ' ' + fruit_2
    elif fruit_1:
        return fruit_1
    elif fruit_2:
        return fruit_2


# create dataframe
data = {
    'fruit1': ["organge", "apple", "organge", "organge", "others"],
    'fruit2': ["apple", "others", "organge", "watermelon", "others"]
}
df = pd.DataFrame(data)

# replace "others" with None to use as boolean later
# concat strings
# replace None values with "others" to get desired output
df["together"] = (
    df
    .replace({"others": None})
    .apply(concat_strings, axis=1)
    .replace({None: "others"})
)

# print final results
print(df)

Resulting output:

      fruit1      fruit2            together
  0  organge       apple       organge apple
  1    apple      others               apple
  2  organge     organge             organge
  3  organge  watermelon  organge watermelon
  4   others      others              others

Collectives™ on Stack Overflow

Pandas: Merge two string columns in Python, remove duplicated strings and remove unwanted string unless only unwanted string left

5 Answers 5

4 Comments

6 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related