2

Want to extract city name from the address which appear after zip code from pandas dataframe. Given: 10 rue des Treuils BP 12 33023, Bordeaux France I want to extract Bordeaux from column of dataframe.

City name is always first after the comma , but it is not guaranteed to be one word. Need to strip off country name which will be a fixed string like France , Italy etc.

More examples of french city names

  • Les Deux Alpes

  • Val dIsere

5
  • Can you provide more specific details? For example, can you assume that the city name is always the first word after the final comma? Also, what have you tried so far? Is regex a requirement or preference? Commented May 12, 2018 at 0:06
  • 1
    How would you handle "..., New York United States"? Is there a fixed list of all country names? Commented May 12, 2018 at 0:10
  • United States will be fixed string which can be stripped off as on exact match? Commented May 12, 2018 at 0:11
  • please post the list of fixed countries in pastebin.com and share the link so we can help you further. Commented May 12, 2018 at 0:42
  • Yes, please. A properly anchored regex could look like this (?<=\d{5}, ).*(?=France|United States) Commented May 12, 2018 at 3:11

3 Answers 3

2

United States will be fixed string which can be stripped off as on exact match


My solution is to remove the country name, which will leave us with the city name only.
This approach seems to be easier since country names are fixed and can be easily removed based on a list, i.e.:

  1. split() the address in two based on comma (,);
  2. replace() the country name with nothing;
  3. Use panda's apply() to apply get_city() function containing the steps above.
  4. Use panda's tolist() to convert column City to a list. This last step is optional, as it depends on what you'll do with the city names.

i.e.:

import pandas as pd
addresses = [['10 rue des Treuils BP 12 33023, Bordeaux France'],['Rua da Alegria 22, Lisboa Portugal'],['22 Some Street, NYC United States']]
df = pd.DataFrame(addresses,columns=['Address'])

countries = ['Portugal', 'France', 'United States']

def get_city(address):
    city_country = address.split(",")[1]
    for i in countries: city = city_country.replace(i, "")
    return city.strip()

df['City'] = df['Address'].apply(get_city)
print (df['City'].tolist())

Output:

['Bordeaux', 'Lisboa', 'NYC']

PS: You may want to lower() both the addresses and countries list in order to avoid case SenSitIve mismatches.

Sign up to request clarification or add additional context in comments.

Comments

0

If we consider your regex to be working with French addresses (ending by France), then you can use this :

/,\s([A-Z][A-Za-z\s-]+)\sFrance/gm

enter image description here

Link to the online regex simulator where I tested the expression

You mentioned earlier about the United States, but actually the way adresses are written is totally different, so you'll have to make another regex for it, I guess. (i.e: 4 Cross Lane Schererville, IN 46375)

Comments

0

Yeah maybe some advanced regex could handle this but the pandas naive approach would be:

import pandas as pd
import numpy as np

col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
                 '10 rue des Treuils BP 12 33023, Les Deux Alpes France',
                 '10 rue des Treuils BP 12 33023, New York United States'])

cities = np.where(col.str.endswith('United States'), 
                  col.str.split(', ').str[1].str.split().str[:-2].str.join(' '), 
                  col.str.split(', ').str[1].str.split().str[:-1].str.join(' '))

print(cities)
#['Bordeaux' 'Les Deux Alpes' 'New York']

A more general but not as effective solution (but who needs speed right?)

import pandas as pd

col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
                 '10 rue des Treuils BP 12 33023, New York United States',
                 '10 rue des Treuils BP 12 33023, Seoul South Korea',
                 '10 rue des Treuils BP 12 33023, Brazzaville Republic of Congo'])

countries = {'United States': 2 , 'South Korea': 2, 'Republic of Congo': 3}
n = [next((countries[k] for k,v in countries.items() if i.endswith(k)), 1) for i in col]
cities = [' '.join(i.split(', ')[1].split()[:-y]) for i,y in zip(col,n)]

print(cities)
# ['Bordeaux', 'Les Deux Alpes', 'New York', 'Seoul', 'Brazzaville']

And then simply assign back with:

df['city'] = cities

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.