Quickest way to find partial string match between two pandas dataframes

Question

I have two location-based pandas DataFrames.

df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.

df1.address[0] --> "Avon Road, Ealing, London, UK"

df2: Which just has cities of UK, such as "London".

df2.city[5] --> "London"

I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.

df1.city[0] --> "London"

Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.

I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:

df1['address'].str.contains("London",na=False).index.values

Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.

I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.

Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)

for the second approah, see this answer with str.extract, in which L would be for you L=df2.city.tolist() or something like this — Ben.T
– Ben.T, Commented Jun 21, 2020 at 20:07
@JérômeRichard df1 contains around 300,000 rows (growing) and df2 contains 240 rows (fixed). — Meruem
– Meruem, Commented Jun 21, 2020 at 20:24

Jérôme Richard · Accepted Answer · 2020-06-21 21:23:42Z

2

Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster. Here is an example code:

cityIndex = set(df2.city)

addressLocations = []
for address in df1.address:
    location = None
    # Warning: ignore characters like '-' in the cities
    for word in re.findall(r'[a-zA-Z0-9]+', address):
        if word in cityIndex:
            location = word
            break
    addressLocations.append(location)
df1['city'] = addressLocations

answered Jun 21, 2020 at 21:23

Jérôme Richard

53.4k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Quickest way to find partial string match between two pandas dataframes

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related