How to remove the duplicate string completely from the list using pandas python?

Question

I have a DataFrame as below:

index text_column

   0   ,(Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
   1   ,Info_concern,Info_concern
   2   ,color_Concern,color_Concern,no_category
   3   ,reg_Concern,reg_Concern

I am trying to remove duplicates including the source value completely within each row.

I tried this:

df['result'] = [set(x) for x in df['text_column']]

This gives me a list of values without duplicates but with source value, I need the source value to be removed as well.

Desired output:

     result
    
  0  (concern_code),(concern_color)
  1  
  2  no_category
  3

Any suggestions or advice ?

SeaBean · Accepted Answer · 2021-09-22 23:17:27Z

1

Version 1: Removing duplicates across all rows:

You can use .drop_duplicates() with parameter keep=False after splitting and expanding the substrings by .str.split() and .explode().

Then, regroup the entries into their original rows by .groupby() on the row index (level 0). Finally, aggregate and join back the substrings of the original same row with .agg() and ','.join

df['result'] = (df['text_column'].str.split(',')
                                 .explode()
                                 .drop_duplicates(keep=False)
                                 .groupby(level=0).agg(','.join)
               )

.drop_duplicates() with parameter keep=False ensures to remove duplicates including the source value.

Alternatively, you can also do it with .stack() in place of .explode(), as follows:

df['result'] = (df['text_column'].str.split(',', expand=True)
                                 .stack()
                                 .drop_duplicates(keep=False)
                                 .groupby(level=0).agg(','.join)
               )

Data Input:

(Added extra test cases from the sample data in question:)

                                                      text_column
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1                                       Info_concern,Info_concern
2                         color_Concern,color_Concern,no_category
3                                         reg_Concern,reg_Concern
4                                                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL

Result:

print(df)

                                                      text_column                          result
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)  (concern_code),(concern_color)
1                                       Info_concern,Info_concern                             NaN
2                         color_Concern,color_Concern,no_category                     no_category
3                                         reg_Concern,reg_Concern                             NaN
4                                                    ABCDEFGHIJKL                             NaN
5                                                    ABCDEFGHIJKL                             NaN

Note the last 2 rows with same strings are removed as duplicates even when they are in different rows.

Version 2: Removing duplicates within the same row only:

If the scope of removing duplicates is limited to only within the same row rather than across all rows, we can achieve this by the following code variation:

df['result'] = (df['text_column'].str.split(',', expand=True)
                                 .stack()
                                 .groupby(level=0)
                                 .agg(lambda x: ','.join(x.drop_duplicates(keep=False)))
               )

Data Input:

(Added extra test cases from the sample data in question:)

                                                      text_column
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)
1                                       Info_concern,Info_concern
2                         color_Concern,color_Concern,no_category
3                                         reg_Concern,reg_Concern
4                                                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL

Output:

print(df)

                                                      text_column                          result
0  (Unable_to_see),(concern_code),(concern_color),(Unable_to_see)  (concern_code),(concern_color)
1                                       Info_concern,Info_concern                                
2                         color_Concern,color_Concern,no_category                     no_category
3                                         reg_Concern,reg_Concern                                
4                                                    ABCDEFGHIJKL                    ABCDEFGHIJKL
5                                                    ABCDEFGHIJKL                    ABCDEFGHIJKL

Note the last 2 rows with same strings are kept since they are in different rows.

edited Sep 22, 2021 at 23:17

answered Sep 22, 2021 at 21:49

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

code_learner Over a year ago

Is there a way to implement this without explode()?

SeaBean Over a year ago

@code_learner Any concern with explode() ? This is the natural way to work with split strings one by one. Without using it, we would need much more complicated way to do it. May not actually save much with the complicated algorithm.

SeaBean Over a year ago

@code_learner See my edit above. Revised without using .explode().

code_learner Over a year ago

I think stack() works for me the way it is expected to, really not sure why explode() wont work though. Appreciate your insights on this.

code_learner Over a year ago

yes dropping duplicates was to be done to the same line only. With the previous solution, removing duplicates across the lines was actually giving me Nan's mostly. The updated solution works perfect. Thanks for clarifying the difference.

|

Collectives™ on Stack Overflow

How to remove the duplicate string completely from the list using pandas python?

1 Answer 1

Version 1: Removing duplicates across all rows:

Version 2: Removing duplicates within the same row only:

13 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Version 1: Removing duplicates across all rows:

Version 2: Removing duplicates within the same row only:

13 Comments

Your Answer

Sign up or log in

Post as a guest

Related