Python/Pandas: Identify Duplicates across Columns

Question

In the following code I would like to identify and report values in Col1 that appear in Col2, values in Col2 that appear in Col1 and overall values that appear more than once.

In the example below values AAPL and GOOG appear in Col1 and Col2. These are expected to be identified and reported in next 2 columns, and in the column after that expecting to identify and report whether "any" of Col1 or Col2 values are DUP.

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print (df)

# How to code after this to produce expected result?
# Appreciate any hint/help provided

This is how the result will appear in Excel

Something like this: df['Col1inCol2']=np.where(df.Col1.isin(df.Col2), 'True','False'). do you want to account for NaNs as well? — skrubber
– skrubber, Commented Jul 5, 2018 at 3:41
Yes. np.nan are not expected to count as DUPs. See the expected result image in Excel — Salil Gangal
– Salil Gangal, Commented Jul 5, 2018 at 3:47
The first cell of col2_vals_exist_in_col1 says False and why is that? — Bharath M Shetty
– Bharath M Shetty, Commented Jul 5, 2018 at 3:56
Dark: There is error in the image. It should be "True" instead of "False". — Salil Gangal
– Salil Gangal, Commented Jul 5, 2018 at 4:30

Kavi Sek · Accepted Answer · 2018-07-05 04:14:10Z

1

Here is a solution for you that works with the code above. It just uses some for loops with itterows(). Nothing fancy.

df['Col3'] = False
df['Col4'] = False
df['Col5'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col3'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col4'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col3'] | df.loc[i,'Col4'] == True:
     df.loc[i,'Col5'] = True

Click here to view image of result

answered Jul 5, 2018 at 4:14

Kavi Sek

2321 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

skrubber · Accepted Answer · 2018-07-05 05:50:00Z

1

Use numpy where to check if one column value is in another, and then boolean OR the columns to check if it's a dupe.

df['Col1inCol2']=np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2inCol1']=np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe']= df.Col1inCol2 | df.Col2inCol1



    Col1    Col2    Col1inCol2  Col2inCol1  Dupe
0   AAPL    GOOG    True            True    True
1   NaN     IBM     False           False   False
2   GOOG    MSFT    True            False   True
3   MMM     NaN     False           False   False
4   NaN     GOOG    False           True    True
5   INTC    AAPL    False           True    True
6   FB       VZ     False           False   False

answered Jul 5, 2018 at 5:50

skrubber

1,1051 gold badge9 silver badges19 bronze badges

2 Comments

Salil Gangal Over a year ago

Thanks "skrubber". The solution works. And its even less code than solution earlier !!

skrubber Over a year ago

Guess you need to click on the Green Tick beside the answer, and vote as well

Salil Gangal · Accepted Answer · 2018-07-05 04:48:17Z

Following is the final script:

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 04-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("Initial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)


df['Col1_val_exists_in_Col2'] = False
df['Col2_val_exists_in_Col1'] = False
df['Dup_in_Frame'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col1_val_exists_in_Col2'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col2_val_exists_in_Col1'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col1_val_exists_in_Col2'] | df.loc[i,'Col2_val_exists_in_Col1'] == True:
     df.loc[i,'Dup_in_Frame'] = True

print ("Final DataFrame\n")
print (df)

Salil Gangal · Accepted Answer · 2018-07-05 12:48:40Z

Another way of doing the task is given below - thanks to "skrubber":

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 05-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={ 
       'Col1':
              ['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],
       'Col2':
              ['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']
     }
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("\n\nInitial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)

df['Col1_val_exists_in_Col2'] = np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2_val_exists_in_Col1'] = np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe'] = df.Col1_val_exists_in_Col2 | df.Col2_val_exists_in_Col1


print ("\n\nFinal DataFrame\n")
print (df)


Initial DataFrame

   Col1  Col2
0  AAPL  GOOG
1   NaN   IBM
2  GOOG  MSFT
3   MMM   NaN
4   NaN  GOOG
5  INTC  AAPL
6    FB    VZ


Final DataFrame

   Col1  Col2  Col1_val_exists_in_Col2  Col2_val_exists_in_Col1   Dupe
0  AAPL  GOOG                     True                     True   True
1   NaN   IBM                    False                    False  False
2  GOOG  MSFT                     True                    False   True
3   MMM   NaN                    False                    False  False
4   NaN  GOOG                    False                     True   True
5  INTC  AAPL                    False                     True   True
6    FB    VZ                    False                    False  False

Collectives™ on Stack Overflow

Python/Pandas: Identify Duplicates across Columns

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related