1

I wish to check that the categories in one dataframe column match the categories in another, ie that there are no mismatches in spelling etc.

I now have two arrays representing all the unique values in the columns of interest, and I would like to return any values that are in the first, smaller array but aren't in the second, larger array, hence then I can narrow down categories I may need to adjust/re-spell etc. I believe I should use a for loop to evaluate each array but I am struggling with the implementation. Example code below, thanks:

borough_pm25 = pm25['Borough_x'].unique()
borough_pm25
array(['Barnet', 'Camden', 'Wandsworth', 'Hounslow', 'Southwark',
       'Westminster', 'Kensington & Chelsea', 'Tower Hamlets',
       'Islington', 'Kingston', 'Barking & Dagenham', 'Waltham Forest',
       'Haringey', 'Lambeth', 'Enfield', 'Greenwich', 'Redbridge',
       'Newham', 'City of London', 'Hackney', 'Richmond', 'Ealing',
       'Hammersmith & Fulham', 'Lewisham', 'Sutton', 'Havering', 'Bexley',
       'Bromley'], dtype=object)

borough_map = map_df['NAME'].unique()
borough_map
array(['Kingston upon Thames', 'Croydon', 'Bromley', 'Hounslow', 'Ealing',
       'Havering', 'Hillingdon', 'Harrow', 'Brent', 'Barnet', 'Lambeth',
       'Southwark', 'Lewisham', 'Greenwich', 'Bexley', 'Enfield',
       'Waltham Forest', 'Redbridge', 'Sutton', 'Richmond upon Thames',
       'Merton', 'Wandsworth', 'Hammersmith and Fulham',
       'Kensington and Chelsea', 'Westminster', 'Camden', 'Tower Hamlets',
       'Islington', 'Hackney', 'Haringey', 'Newham',
       'Barking and Dagenham', 'City of London'], dtype=object)
1
  • Thanks Mihai, yes this works in the sense that it returns False, ie there is a mismatch, however I need to return the actual values which do not match. Commented Feb 8, 2020 at 18:00

1 Answer 1

3

You can use set operations.

import numpy as np
a=np.array(['Barnet', 'Camden', 'Wandsworth', 'Hounslow', 'Southwark',
       'Westminster', 'Kensington & Chelsea', 'Tower Hamlets',
       'Islington', 'Kingston', 'Barking & Dagenham', 'Waltham Forest',
       'Haringey', 'Lambeth', 'Enfield', 'Greenwich', 'Redbridge',
       'Newham', 'City of London', 'Hackney', 'Richmond', 'Ealing',
       'Hammersmith & Fulham', 'Lewisham', 'Sutton', 'Havering', 'Bexley',
       'Bromley'])
b=np.array(['Kingston upon Thames', 'Croydon', 'Bromley', 'Hounslow', 'Ealing',
       'Havering', 'Hillingdon', 'Harrow', 'Brent', 'Barnet', 'Lambeth',
       'Southwark', 'Lewisham', 'Greenwich', 'Bexley', 'Enfield',
       'Waltham Forest', 'Redbridge', 'Sutton', 'Richmond upon Thames',
       'Merton', 'Wandsworth', 'Hammersmith and Fulham',
       'Kensington and Chelsea', 'Westminster', 'Camden', 'Tower Hamlets',
       'Islington', 'Hackney', 'Haringey', 'Newham',
       'Barking and Dagenham', 'City of London'])

print(set(a)-set(b)) #(set A – set B) will be the elements present in set A but not in B
print(set(b)-set(a)) #(set B – set A) will be the elements present in set B but not in set A
print(set(a)-set(b)|set(b)-set(a))

{'Barking & Dagenham',
 'Hammersmith & Fulham',
 'Kensington & Chelsea',
 'Kingston',
 'Richmond'}  #set(a)-set(b)

{'Barking and Dagenham',
 'Brent',
 'Croydon',
 'Hammersmith and Fulham',
 'Harrow',
 'Hillingdon',
 'Kensington and Chelsea',
 'Kingston upon Thames',
 'Merton',
 'Richmond upon Thames'}  #set(b)-set(a)

{'Barking & Dagenham',
 'Barking and Dagenham',
 'Brent',
 'Croydon',
 'Hammersmith & Fulham',
 'Hammersmith and Fulham',
 'Harrow',
 'Hillingdon',
 'Kensington & Chelsea',
 'Kensington and Chelsea',
 'Kingston',
 'Kingston upon Thames',
 'Merton',
 'Richmond',
 'Richmond upon Thames'}
Sign up to request clarification or add additional context in comments.

1 Comment

Glad I helped you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.