1

Dataframe click to see the screenshot because I am new here, I need 10 reputation to embed pics

Expected result

Dataframe is imported from a csv file. 'types' and 'themes' are item's properties. 'Tags' is a long string column that contains mixed (randomly ordered) tags of each item(separated by ', '). Basically I need to do is to check if there is a correct theme tag (col_{theme}) in 'Tags' column, and if there is no, add it to 'Tags' column.

For example:

item 8: there is a 'col_t3' in 'Tags' column, and its theme is 't3'. so this is correct and we pass.

item 1: there is a 'col_t1' in 'Tags' column, but its actual theme is 't2', so I need to replace 'col_t1' with 'col_t2' and keep other tags unchanged in the same column

item 2 and item 5: there is no 'col_{theme}' tag in 'Tags' column, so I add add 'col_t1' and 'col_t5' to their 'Tags' column respectively.

Please help !!

6
  • Could you help me on display the screenshot image? the post only shows link .I am new here thank you! Commented Feb 22, 2018 at 2:49
  • CTRL + G then CTRL + V... or use the tools above the field. Commented Feb 22, 2018 at 2:52
  • 2
    I did, I know why now. It says because I am new here and need 10 reputation point to embed pics. What a pity. Commented Feb 22, 2018 at 2:58
  • good to know. If someone gets rid of the rest of your negative points on this question and you get a +1 you’ll be there. I can only do one vote. Commented Feb 22, 2018 at 2:59
  • @Ray, please do not use images because it makes it harder for other people to reproduce your data. Please paste raw data instead if possible. Commented Feb 22, 2018 at 3:09

1 Answer 1

1

This emulates the input you are showing in your screenshot:

import pandas as pd
import numpy as np

df = pd.DataFrame({"type": ["a", "c", "d", "a", "b", "a", "a", "c"], 
                  "tags": ["col_t1, col_red, large", np.nan, "col_t2, col_black, small", 
                           "col_t4, large, col_yellow", "col_gold, col_fancy,", "col_t1, thick, col_k",
                          np.nan, "col_t3, fancy, red"],
                  "theme": ["t2", "t1", "t2", "t3", "t2", "t1", np.nan, "t3"]})

df.set_index(np.arange(1, len(df)+1), inplace=True)
print df

Output:

                      tags theme type
1     col_t1, col_red, large    t2    a
2                        NaN    t1    c
3   col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow    t3    a
5       col_gold, col_fancy,    t2    b
6       col_t1, thick, col_k    t1    a
7                        NaN   NaN    a
8         col_t3, fancy, red    t3    c

Code that produces the desired output:

prefix = "col_"

# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():

    if pd.isnull(row.tags):
        # Replace NaN in tags column with a single tag from theme column 
        df.loc[row.Index, "tags"] = prefix + row.theme
    else:
        # Extract existing tags with prefix
        inferred_tags = [t.replace(prefix, "") for t in row.tags.split(",") if prefix in t] 

        if row.theme not in inferred_tags:
            df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme     
print df

Output:

                                tags theme type
1     col_t1, col_red, large, col_t2    t2    a
2                             col_t1    t1    c
3           col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow, col_t3    t3    a
5        col_gold, col_fancy, col_t2    t2    b
6               col_t1, thick, col_k    t1    a
7                                NaN   NaN    a
8                 col_t3, fancy, red    t3    c

Hopefully this is what you are looking for. itertuples() is claimed to be faster for iterating over all rows than iterrows(). Also, keep in mind that I used numpy and specifically np.nan to emulate NaNs in your input, but if your data comes from csv, you won't need numpy.

--- UPDATE ---

As explained in the comments, the code should replace tags that match themes. Here is the updated solution:

prefix = "col_"

# Find all unique themes (notnull() excludes nan from the list)
themes = df[df["theme"].notnull()]["theme"].unique()

# Add prefex to all themes for comparison with tags; convert to set 
prefixed_themes = set([prefix + t for t in themes])

# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():

    if pd.isnull(row.tags):
        # Replace NaN in tags column with a single tag from theme column 
        df.loc[row.Index, "tags"] = prefix + row.theme
    else:
        # Extract existing tags with prefix (do not remove prefix; remove all spaces)
        inferred_tags = row.tags.replace(" ", "").split(",")

        # Use sets to check if there is any intersection between tags and themes
        if len(set(inferred_tags).intersection(prefixed_themes)) > 0:

            # Iterate over inferred_tags to find and replace matches with themes 
            for idx, t in enumerate(inferred_tags):
                if t in prefixed_themes:
                    inferred_tags[idx] = prefix + row.theme

            df.loc[row.Index, "tags"] = ", ".join(inferred_tags) 
        else:
            # In this case, add theme to tags (no replacement)
            df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme 

print df

Output:

                                tags theme type
1             col_t2, col_red, large    t2    a
2                             col_t1    t1    c
3           col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow, col_t3    t3    a
5        col_gold, col_fancy, col_t2    t2    b
6               col_t1, thick, col_k    t1    a
7                                NaN   NaN    a
8                 col_t3, fancy, red    t3    c

Notice that the code checks tags against all values present in the theme column (with added prefix); if a value (like t4) is not in the theme column, it is not considered a legal theme tag and therefore col_t4 in item 4 is not replaced during processing. If you need all col_t* to be replaced, you need to be specific about it. Hopefully, this is a useful solution and you can take it from here.

Sign up to request clarification or add additional context in comments.

6 Comments

So good but one flaw. Your solution is pretty much what I was looking for. The only one thing wasn't right is you didn't replace the incorrect tags. for example, item 1 has 'col_t1' and its actual them is 't2', so when you add 'col_t2' to tags column for item 1, 'col_t1' should be replaced or deleted. but I think you only need a little bit change in your code. (to be continued...)
What I did was using regular expression, (the original data string is more complicated) But I like your solution much better, I will try to use your method next time editing the data. Thank you . By the way , if you understand what I said , could you update your solution ? Appreciate
Ok, I see. Does the order of tags matter (can they be rearranged, e.g., to make col_<theme> tags go first or last if present)? Also, is there maximum 1 col_<theme> (where <theme> comes from the theme column) in tags per row or it is possible to have more? Finally, can be assumed that it is always col_tX (where t is always present and X is a number) or the themes can have different format?
Check the updated solution above. There, the order of tags is preserved. Tags are replaced, except for the ones that are not present in the theme column, as described in the comment at the end.
the order doesn't matter. Always a col_tX in Tags . You did great job!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.