1

I have a DataFrame as below:

index text_column

   0   ,(Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
   1   ,Info_concern,Info_concern
   2   ,color_Concern,color_Concern,no_category
   3   ,reg_Concern,reg_Concern

I am trying to remove duplicates including the source value completely within each row.

I tried this:

df['result'] = [set(x) for x in df['text_column']]

This gives me a list of values without duplicates but with source value, I need the source value to be removed as well.

Desired output:

     result
    
  0  (concern_code),(concern_color)
  1  
  2  no_category
  3  

Any suggestions or advice ?

1 Answer 1

1

Version 1: Removing duplicates across all rows:

You can use .drop_duplicates() with parameter keep=False after splitting and expanding the substrings by .str.split() and .explode().

Then, regroup the entries into their original rows by .groupby() on the row index (level 0). Finally, aggregate and join back the substrings of the original same row with .agg() and ','.join

df['result'] = (df['text_column'].str.split(',')
                                 .explode()
                                 .drop_duplicates(keep=False)
                                 .groupby(level=0).agg(','.join)
               )

.drop_duplicates() with parameter keep=False ensures to remove duplicates including the source value.

Alternatively, you can also do it with .stack() in place of .explode(), as follows:

df['result'] = (df['text_column'].str.split(',', expand=True)
                                 .stack()
                                 .drop_duplicates(keep=False)
                                 .groupby(level=0).agg(','.join)
               )

Data Input:

(Added extra test cases from the sample data in question:)

                                                      text_column
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1                                       Info_concern,Info_concern
2                         color_Concern,color_Concern,no_category
3                                         reg_Concern,reg_Concern
4                                                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL

Result:

print(df)

                                                      text_column                          result
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)  (concern_code),(concern_color)
1                                       Info_concern,Info_concern                             NaN
2                         color_Concern,color_Concern,no_category                     no_category
3                                         reg_Concern,reg_Concern                             NaN
4                                                    ABCDEFGHIJKL                             NaN
5                                                    ABCDEFGHIJKL                             NaN

Note the last 2 rows with same strings are removed as duplicates even when they are in different rows.

Version 2: Removing duplicates within the same row only:

If the scope of removing duplicates is limited to only within the same row rather than across all rows, we can achieve this by the following code variation:

df['result'] = (df['text_column'].str.split(',', expand=True)
                                 .stack()
                                 .groupby(level=0)
                                 .agg(lambda x: ','.join(x.drop_duplicates(keep=False)))
               )

Data Input:

(Added extra test cases from the sample data in question:)

                                                      text_column
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1                                       Info_concern,Info_concern
2                         color_Concern,color_Concern,no_category
3                                         reg_Concern,reg_Concern
4                                                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL

Output:

print(df)

                                                      text_column                          result
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)  (concern_code),(concern_color)
1                                       Info_concern,Info_concern                                
2                         color_Concern,color_Concern,no_category                     no_category
3                                         reg_Concern,reg_Concern                                
4                                                    ABCDEFGHIJKL                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL                    ABCDEFGHIJKL

Note the last 2 rows with same strings are kept since they are in different rows.

Sign up to request clarification or add additional context in comments.

13 Comments

Is there a way to implement this without explode()?
@code_learner Any concern with explode() ? This is the natural way to work with split strings one by one. Without using it, we would need much more complicated way to do it. May not actually save much with the complicated algorithm.
@code_learner See my edit above. Revised without using .explode().
I think stack() works for me the way it is expected to, really not sure why explode() wont work though. Appreciate your insights on this.
yes dropping duplicates was to be done to the same line only. With the previous solution, removing duplicates across the lines was actually giving me Nan's mostly. The updated solution works perfect. Thanks for clarifying the difference.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.