Lets say I have 2 dataframes with names of cities but with different formats. So, I want to match them based on their states, and the first four characters of each city name. A small example is as follows:
import pandas as pd
df1 = pd.DataFrame({'city': ['NEW YORK', 'DALLAS', 'LOS ANGELES', 'SAN FRANCISCO'],
'state' : ['NY', 'TX', 'CA', 'CA'],
'value' : [1,2,3,4]})
df2 = pd.DataFrame({'city': ['NEW YORK CITY', 'DALLAS/ABC', 'LOS ANG', 'ABC'],
'state': ['NY', 'TX', 'CA', 'CA'],
'temp': [20,21,21,23]})
df1
city state value
0 NEW YORK NY 1
1 DALLAS TX 2
2 LOS ANGELES CA 3
3 SAN FRANCISCO CA 4
df2
city state temp
0 NEW YORK CITY NY 20
1 DALLAS/ABC TX 21
2 LOS ANG CA 21
3 ABC CA 23
What I want is a dataframe as follows:
city state temp values
0 NEW YORK NY 20 1
1 DALLAS TX 21 2
2 LOS ANG CA 21 3
Now, it follows that I cannot use the isin() since that will since that will result in the city names not matching. So far, I am thinking of using str.contains but cannot think of an efficient way to do this.
Help is greatly appreciated.