2

I have two location-based pandas DataFrames.

df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.

df1.address[0] --> "Avon Road, Ealing, London, UK"

df2: Which just has cities of UK, such as "London".

df2.city[5] --> "London"

I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.

df1.city[0] --> "London"

Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.

I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:

df1['address'].str.contains("London",na=False).index.values  

Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.

I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.

Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)

4
  • See stackoverflow.com/questions/54756025/… Commented Jun 21, 2020 at 20:00
  • for the second approah, see this answer with str.extract, in which L would be for you L=df2.city.tolist() or something like this Commented Jun 21, 2020 at 20:07
  • How many rows df1 and df2 contains approximately? Commented Jun 21, 2020 at 20:09
  • @JérômeRichard df1 contains around 300,000 rows (growing) and df2 contains 240 rows (fixed). Commented Jun 21, 2020 at 20:24

1 Answer 1

2

Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster. Here is an example code:

cityIndex = set(df2.city)

addressLocations = []
for address in df1.address:
    location = None
    # Warning: ignore characters like '-' in the cities
    for word in re.findall(r'[a-zA-Z0-9]+', address):
        if word in cityIndex:
            location = word
            break
    addressLocations.append(location)
df1['city'] = addressLocations
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.