0

Lets say I have 2 dataframes with names of cities but with different formats. So, I want to match them based on their states, and the first four characters of each city name. A small example is as follows:

import pandas as pd
df1 = pd.DataFrame({'city': ['NEW YORK', 'DALLAS', 'LOS ANGELES', 'SAN FRANCISCO'],
                   'state' : ['NY', 'TX', 'CA', 'CA'],
                   'value' : [1,2,3,4]})
df2 = pd.DataFrame({'city': ['NEW YORK CITY', 'DALLAS/ABC', 'LOS ANG', 'ABC'],
                    'state': ['NY', 'TX', 'CA', 'CA'],
                   'temp': [20,21,21,23]})
df1
        city    state   value
    0   NEW YORK    NY  1
    1   DALLAS  TX  2
    2   LOS ANGELES CA  3
    3   SAN FRANCISCO   CA  4

df2 
    city    state   temp
0   NEW YORK CITY   NY  20
1   DALLAS/ABC  TX  21
2   LOS ANG CA  21
3   ABC CA  23

What I want is a dataframe as follows:

city    state   temp    values
0   NEW YORK    NY  20  1
1   DALLAS  TX  21  2
2   LOS ANG CA  21  3

Now, it follows that I cannot use the isin() since that will since that will result in the city names not matching. So far, I am thinking of using str.contains but cannot think of an efficient way to do this.

Help is greatly appreciated.

2 Answers 2

1

Create a temporary city4 column with 4 character to use merge

In [5247]: pd.merge(df1.assign(city4=df1.city.str[:4]),
                    df2.assign(city4=df2.city.str[:4]), 
                    on=['city4', 'state']).drop('city4', 1)
Out[5247]:
        city_x state  value         city_y  temp
0     NEW YORK    NY      1  NEW YORK CITY    20
1       DALLAS    TX      2     DALLAS/ABC    21
2  LOS ANGELES    CA      3        LOS ANG    21

More specifically.

In [5251]: (pd.merge(df1.assign(city4=df1.city.str[:4]),
      ...:           df2.assign(city4=df2.city.str[:4]),
      ...:           on=['city4', 'state'])
              .drop(['city4', 'city_y'], 1)
              .rename(columns={'city_x': 'city'}))
Out[5251]:
          city state  value  temp
0     NEW YORK    NY      1    20
1       DALLAS    TX      2    21
2  LOS ANGELES    CA      3    21

Details

In [5255]: df1.assign(city4=df1.city.str[:4])
Out[5255]:
            city state  value city4
0       NEW YORK    NY      1  NEW
1         DALLAS    TX      2  DALL
2    LOS ANGELES    CA      3  LOS
3  SAN FRANCISCO    CA      4  SAN

In [5256]: df2.assign(city4=df2.city.str[:4])
Out[5256]:
            city state  temp city4
0  NEW YORK CITY    NY    20  NEW
1     DALLAS/ABC    TX    21  DALL
2        LOS ANG    CA    21  LOS
3            ABC    CA    23   ABC
Sign up to request clarification or add additional context in comments.

Comments

0

one way using map by creating keys using state and 4 letters of city i.e

one = df1.state+df1.city.str[:4]
two = df2.state+df2.city.str[:4]
df1['temp']=(one).map(df2.set_index(two)['temp'].to_dict())
df1 = df1.dropna()
          city state  value  temp
0     NEW YORK    NY      1  20.0
1       DALLAS    TX      2  21.0
2  LOS ANGELES    CA      3  21.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.