0

In the following code I would like to identify and report values in Col1 that appear in Col2, values in Col2 that appear in Col1 and overall values that appear more than once.

In the example below values AAPL and GOOG appear in Col1 and Col2. These are expected to be identified and reported in next 2 columns, and in the column after that expecting to identify and report whether "any" of Col1 or Col2 values are DUP.

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print (df)

# How to code after this to produce expected result?
# Appreciate any hint/help provided

This is how the result will appear in Excel

5
  • Something like this: df['Col1inCol2']=np.where(df.Col1.isin(df.Col2), 'True','False'). do you want to account for NaNs as well? Commented Jul 5, 2018 at 3:41
  • Yes. np.nan are not expected to count as DUPs. See the expected result image in Excel Commented Jul 5, 2018 at 3:47
  • The first cell of col2_vals_exist_in_col1 says False and why is that? Commented Jul 5, 2018 at 3:56
  • Dark: There is error in the image. It should be "True" instead of "False". Commented Jul 5, 2018 at 4:30
  • was caught up; see my reply below. Commented Jul 5, 2018 at 5:50

4 Answers 4

1

Here is a solution for you that works with the code above. It just uses some for loops with itterows(). Nothing fancy.

df['Col3'] = False
df['Col4'] = False
df['Col5'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col3'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col4'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col3'] | df.loc[i,'Col4'] == True:
     df.loc[i,'Col5'] = True

Click here to view image of result

Sign up to request clarification or add additional context in comments.

Comments

1

Use numpy where to check if one column value is in another, and then boolean OR the columns to check if it's a dupe.

df['Col1inCol2']=np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2inCol1']=np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe']= df.Col1inCol2 | df.Col2inCol1



    Col1    Col2    Col1inCol2  Col2inCol1  Dupe
0   AAPL    GOOG    True            True    True
1   NaN     IBM     False           False   False
2   GOOG    MSFT    True            False   True
3   MMM     NaN     False           False   False
4   NaN     GOOG    False           True    True
5   INTC    AAPL    False           True    True
6   FB       VZ     False           False   False

2 Comments

Thanks "skrubber". The solution works. And its even less code than solution earlier !!
Guess you need to click on the Green Tick beside the answer, and vote as well
0

Following is the final script:

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 04-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("Initial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)


df['Col1_val_exists_in_Col2'] = False
df['Col2_val_exists_in_Col1'] = False
df['Dup_in_Frame'] = False

for i,row in df.iterrows():
  if df.loc[i,'Col1'] in (df.Col2.values):
     df.loc[i,'Col1_val_exists_in_Col2'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col2'] in (df.Col1.values):
     df.loc[i,'Col2_val_exists_in_Col1'] = True

for i,row in df.iterrows():
  if df.loc[i,'Col1_val_exists_in_Col2'] | df.loc[i,'Col2_val_exists_in_Col1'] == True:
     df.loc[i,'Dup_in_Frame'] = True

print ("Final DataFrame\n")
print (df)

Comments

0

Another way of doing the task is given below - thanks to "skrubber":

##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 05-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################

import pandas as pd
import numpy as np
data={ 
       'Col1':
              ['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],
       'Col2':
              ['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']
     }
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("\n\nInitial DataFrame\n")
print (df)

pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)

df['Col1_val_exists_in_Col2'] = np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2_val_exists_in_Col1'] = np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe'] = df.Col1_val_exists_in_Col2 | df.Col2_val_exists_in_Col1


print ("\n\nFinal DataFrame\n")
print (df)


Initial DataFrame

   Col1  Col2
0  AAPL  GOOG
1   NaN   IBM
2  GOOG  MSFT
3   MMM   NaN
4   NaN  GOOG
5  INTC  AAPL
6    FB    VZ


Final DataFrame

   Col1  Col2  Col1_val_exists_in_Col2  Col2_val_exists_in_Col1   Dupe
0  AAPL  GOOG                     True                     True   True
1   NaN   IBM                    False                    False  False
2  GOOG  MSFT                     True                    False   True
3   MMM   NaN                    False                    False  False
4   NaN  GOOG                    False                     True   True
5  INTC  AAPL                    False                     True   True
6    FB    VZ                    False                    False  False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.