1

I'm trying to merge two string columns and I wish to get rid of 'others' if the counter value is a 'non-others' value - like 'apple' + 'others' = 'apple' but 'others' + 'others' = 'others'. I managed the 2nd condition but how can I accommodate the two conditions on the merge?

data = {'fruit1':["organge", "apple", "organge", "organge", "others"],
        'fruit2':["apple", "others", "organge", "watermelon", "others"]}
df = pd.DataFrame(data)

df["together"] = df["fruit1"] + ' ' + df["fruit2"]
df["together"] = df["together"].apply(lambda x: ' '.join(pd.unique(x.split())))

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others        apple others
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

Expected output:

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others
6
  • What is the counter value? Commented Oct 6, 2021 at 8:50
  • If you look at row index 1, that would be 'apple' in this case. So when joining the two columns, if there's another value other than 'others', I wish to remove the 'others' but keep the value like 'apple'. Commented Oct 6, 2021 at 8:52
  • 1
    What happens if you have an others in the first column? Like others, apple Commented Oct 6, 2021 at 8:54
  • In this case, you can look at row index 4 where 'others' + 'others' = 'others'. Commented Oct 6, 2021 at 8:55
  • But what about 'others' + 'apple'? Is the result 'apple' or 'others apple'? Commented Oct 6, 2021 at 8:58

5 Answers 5

5

You want to replace only one "others", so simple join and then use str.replace once:

df["together"] = (df["fruit1"] + " " + df["fruit2"]).str.replace("others", "", n=1).str.strip()
print(df)

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge     organge organge
3  organge  watermelon  organge watermelon
4   others      others              others

The n parameter specifies the number of replacements to be made, from the documentation:

n int, default -1 (all)
Number of replacements to make from start.

UPDATE

To also remove duplicates use the following regular expression:

df["together"] = df["together"].str.replace(r"\b(\w+)\s+\1\b", r"\1", n=1, regex=True).str.strip()
print(df)

Output

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

See here an explanation of the regex.

Sign up to request clarification or add additional context in comments.

4 Comments

Elegantly done! Thanks!
@codedancer This does not remove duplicates. Note the organge organge in result.
@codedancer Could you clarify on this?
I think @SeaBean's comment is correct. It doesn't remove the duplicate value. I tried to make it 3 steps again - doing mine first and yours later but it is not working as expected.
4

You can replace others by NaN and dropna() during join followed by replacing empty string by a single others:

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
                   .apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
                   .replace('', 'others')
                 )

Or leveraging the str.replace with n=1 by @Dani (caution: won't work if ther are 3 columns to aggregate; may leave 2 instances of others) and combining the remove duplicate logics of OP, as follows:

df["together"] = (df["fruit1"] + " " + df["fruit2"]).str.replace("others", "", n=1).apply(lambda x: ' '.join(pd.unique(x.split())))

Result:

print(df)

    fruit1      fruit2            together
0  organge       apple       organge apple
1    apple      others               apple
2  organge     organge             organge
3  organge  watermelon  organge watermelon
4   others      others              others

6 Comments

Just run through the thing again and you're right. Just checked yours and it works as expected. I was so focused on removing the others but ignored on the duplicate condition. Thanks for pointing out the porblem!
@codedancer You are welcome! And take note that if you have 3 columns to aggregate, it may also just remove one and leave with 2 others. Anyway, it's a very good feature to learn.
I am wondering if there's a way we can prevent this when more columns are involved.
@codedancer My first solution replacing others by NaN works for more columns.
Just tested it on a large scale, it works as you suggested!
|
1
def merge_columns(df, col1, col2, new_col, unwanted_string):
    '''Merge two string columns and replace unwanted string with existing string'''
    df[new_col] = df[col1].astype(str) + df[col2].astype(str)
    df[new_col] = df[new_col].str.replace(unwanted_string, '')
    return df

Comments

1

You can do it in a lambda function as follows (copy-pasting it will work):

df.together = df.together.apply(lambda x: x if 'others' not in x else ('others' if all([y == '' for y in x.split('others')]) else x.replace('others', '').strip()))

Giving you:

    fruit1  fruit2      together
0   organge apple       organge apple
1   apple   others      apple
2   organge organge     organge
3   organge watermelon  organge watermelon
4   others  others      others

Comments

1

You can set the resulting value based on an if-statement and the fruit_x value. To do so, I suggest to replace() others with None which makes conditional checking and re-replacing very easy. Call .apply() with axis=1 to a perform row-wise operation which I implemented in a separate function concat_strings for sake of readability. In addition, I would chain all required operations in a single statement in order to not change the original data set.

A very basic approach could look like this:

import pandas as pd


def concat_strings(row):
    fruit_1 = row['fruit1']
    fruit_2 = row['fruit2']

    if fruit_1 == fruit_2:
        return fruit_1
    elif fruit_1 and fruit_2:
        return fruit_1 + ' ' + fruit_2
    elif fruit_1:
        return fruit_1
    elif fruit_2:
        return fruit_2


# create dataframe
data = {
    'fruit1': ["organge", "apple", "organge", "organge", "others"],
    'fruit2': ["apple", "others", "organge", "watermelon", "others"]
}
df = pd.DataFrame(data)

# replace "others" with None to use as boolean later
# concat strings
# replace None values with "others" to get desired output
df["together"] = (
    df
    .replace({"others": None})
    .apply(concat_strings, axis=1)
    .replace({None: "others"})
)

# print final results
print(df)

Resulting output:

      fruit1      fruit2            together
  0  organge       apple       organge apple
  1    apple      others               apple
  2  organge     organge             organge
  3  organge  watermelon  organge watermelon
  4   others      others              others

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.