python dataframe replace partial strings in a column based on other column's condition

Question

Dataframe click to see the screenshot because I am new here, I need 10 reputation to embed pics

Dataframe is imported from a csv file. 'types' and 'themes' are item's properties. 'Tags' is a long string column that contains mixed (randomly ordered) tags of each item(separated by ', '). Basically I need to do is to check if there is a correct theme tag (col_{theme}) in 'Tags' column, and if there is no, add it to 'Tags' column.

For example：

item 8: there is a 'col_t3' in 'Tags' column, and its theme is 't3'. so this is correct and we pass.

item 1: there is a 'col_t1' in 'Tags' column, but its actual theme is 't2', so I need to replace 'col_t1' with 'col_t2' and keep other tags unchanged in the same column

item 2 and item 5: there is no 'col_{theme}' tag in 'Tags' column, so I add add 'col_t1' and 'col_t5' to their 'Tags' column respectively.

Please help !!

Could you help me on display the screenshot image? the post only shows link .I am new here thank you! — Ray
– Ray, Commented Feb 22, 2018 at 2:49
I did, I know why now. It says because I am new here and need 10 reputation point to embed pics. What a pity. — Ray
– Ray, Commented Feb 22, 2018 at 2:58
good to know. If someone gets rid of the rest of your negative points on this question and you get a +1 you’ll be there. I can only do one vote. — Brien Foss
– Brien Foss, Commented Feb 22, 2018 at 2:59
@Ray, please do not use images because it makes it harder for other people to reproduce your data. Please paste raw data instead if possible. — Allen Qin
– Allen Qin, Commented Feb 22, 2018 at 3:09

Dmitry Duplyakin · Accepted Answer · 2018-02-23 17:07:31Z

1

This emulates the input you are showing in your screenshot:

import pandas as pd
import numpy as np

df = pd.DataFrame({"type": ["a", "c", "d", "a", "b", "a", "a", "c"], 
                  "tags": ["col_t1, col_red, large", np.nan, "col_t2, col_black, small", 
                           "col_t4, large, col_yellow", "col_gold, col_fancy,", "col_t1, thick, col_k",
                          np.nan, "col_t3, fancy, red"],
                  "theme": ["t2", "t1", "t2", "t3", "t2", "t1", np.nan, "t3"]})

df.set_index(np.arange(1, len(df)+1), inplace=True)
print df

Output:

                      tags theme type
1     col_t1, col_red, large    t2    a
2                        NaN    t1    c
3   col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow    t3    a
5       col_gold, col_fancy,    t2    b
6       col_t1, thick, col_k    t1    a
7                        NaN   NaN    a
8         col_t3, fancy, red    t3    c

Code that produces the desired output:

prefix = "col_"

# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():

    if pd.isnull(row.tags):
        # Replace NaN in tags column with a single tag from theme column 
        df.loc[row.Index, "tags"] = prefix + row.theme
    else:
        # Extract existing tags with prefix
        inferred_tags = [t.replace(prefix, "") for t in row.tags.split(",") if prefix in t] 

        if row.theme not in inferred_tags:
            df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme     
print df

Output:

                                tags theme type
1     col_t1, col_red, large, col_t2    t2    a
2                             col_t1    t1    c
3           col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow, col_t3    t3    a
5        col_gold, col_fancy, col_t2    t2    b
6               col_t1, thick, col_k    t1    a
7                                NaN   NaN    a
8                 col_t3, fancy, red    t3    c

Hopefully this is what you are looking for. itertuples() is claimed to be faster for iterating over all rows than iterrows(). Also, keep in mind that I used numpy and specifically np.nan to emulate NaNs in your input, but if your data comes from csv, you won't need numpy.

--- UPDATE ---

As explained in the comments, the code should replace tags that match themes. Here is the updated solution:

prefix = "col_"

# Find all unique themes (notnull() excludes nan from the list)
themes = df[df["theme"].notnull()]["theme"].unique()

# Add prefex to all themes for comparison with tags; convert to set 
prefixed_themes = set([prefix + t for t in themes])

# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():

    if pd.isnull(row.tags):
        # Replace NaN in tags column with a single tag from theme column 
        df.loc[row.Index, "tags"] = prefix + row.theme
    else:
        # Extract existing tags with prefix (do not remove prefix; remove all spaces)
        inferred_tags = row.tags.replace(" ", "").split(",")

        # Use sets to check if there is any intersection between tags and themes
        if len(set(inferred_tags).intersection(prefixed_themes)) > 0:

            # Iterate over inferred_tags to find and replace matches with themes 
            for idx, t in enumerate(inferred_tags):
                if t in prefixed_themes:
                    inferred_tags[idx] = prefix + row.theme

            df.loc[row.Index, "tags"] = ", ".join(inferred_tags) 
        else:
            # In this case, add theme to tags (no replacement)
            df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme 

print df

Output:

                                tags theme type
1             col_t2, col_red, large    t2    a
2                             col_t1    t1    c
3           col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow, col_t3    t3    a
5        col_gold, col_fancy, col_t2    t2    b
6               col_t1, thick, col_k    t1    a
7                                NaN   NaN    a
8                 col_t3, fancy, red    t3    c

Notice that the code checks tags against all values present in the theme column (with added prefix); if a value (like t4) is not in the theme column, it is not considered a legal theme tag and therefore col_t4 in item 4 is not replaced during processing. If you need all col_t* to be replaced, you need to be specific about it. Hopefully, this is a useful solution and you can take it from here.

edited Feb 23, 2018 at 17:07

answered Feb 22, 2018 at 15:22

Dmitry Duplyakin

1801 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ray Over a year ago

So good but one flaw. Your solution is pretty much what I was looking for. The only one thing wasn't right is you didn't replace the incorrect tags. for example, item 1 has 'col_t1' and its actual them is 't2', so when you add 'col_t2' to tags column for item 1, 'col_t1' should be replaced or deleted. but I think you only need a little bit change in your code. (to be continued...)

Ray Over a year ago

What I did was using regular expression, (the original data string is more complicated) But I like your solution much better, I will try to use your method next time editing the data. Thank you . By the way , if you understand what I said , could you update your solution ? Appreciate

Dmitry Duplyakin Over a year ago

Ok, I see. Does the order of tags matter (can they be rearranged, e.g., to make col_<theme> tags go first or last if present)? Also, is there maximum 1 col_<theme> (where <theme> comes from the theme column) in tags per row or it is possible to have more? Finally, can be assumed that it is always col_tX (where t is always present and X is a number) or the themes can have different format?

Dmitry Duplyakin Over a year ago

Check the updated solution above. There, the order of tags is preserved. Tags are replaced, except for the ones that are not present in the theme column, as described in the comment at the end.

Ray Over a year ago

the order doesn't matter. Always a col_tX in Tags . You did great job!

|

Collectives™ on Stack Overflow

python dataframe replace partial strings in a column based on other column's condition

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related