How to extract string from dataframe after regex matching

Question

Want to extract city name from the address which appear after zip code from pandas dataframe. Given: 10 rue des Treuils BP 12 33023, Bordeaux France I want to extract Bordeaux from column of dataframe.

City name is always first after the comma , but it is not guaranteed to be one word. Need to strip off country name which will be a fixed string like France , Italy etc.

More examples of french city names

Les Deux Alpes
Val dIsere

Can you provide more specific details? For example, can you assume that the city name is always the first word after the final comma? Also, what have you tried so far? Is regex a requirement or preference? — jpp
– jpp, Commented May 12, 2018 at 0:06
How would you handle "..., New York United States"? Is there a fixed list of all country names? — DYZ
– DYZ, Commented May 12, 2018 at 0:10
United States will be fixed string which can be stripped off as on exact match? — user1411335
– user1411335, Commented May 12, 2018 at 0:11
please post the list of fixed countries in pastebin.com and share the link so we can help you further. — Pedro Lobito
– Pedro Lobito, Commented May 12, 2018 at 0:42
Yes, please. A properly anchored regex could look like this (?<=\d{5}, ).*(?=France|United States) — wp78de
– wp78de, Commented May 12, 2018 at 3:11

Pedro Lobito · Accepted Answer · 2018-05-17 03:36:29Z

United States will be fixed string which can be stripped off as on exact match

My solution is to remove the country name, which will leave us with the city name only.
This approach seems to be easier since country names are fixed and can be easily removed based on a list, i.e.:

split() the address in two based on comma (,);
replace() the country name with nothing;
Use panda's apply() to apply get_city() function containing the steps above.
Use panda's tolist() to convert column City to a list. This last step is optional, as it depends on what you'll do with the city names.

i.e.:

import pandas as pd
addresses = [['10 rue des Treuils BP 12 33023, Bordeaux France'],['Rua da Alegria 22, Lisboa Portugal'],['22 Some Street, NYC United States']]
df = pd.DataFrame(addresses,columns=['Address'])

countries = ['Portugal', 'France', 'United States']

def get_city(address):
    city_country = address.split(",")[1]
    for i in countries: city = city_country.replace(i, "")
    return city.strip()

df['City'] = df['Address'].apply(get_city)
print (df['City'].tolist())

Output:

['Bordeaux', 'Lisboa', 'NYC']

PS: You may want to lower() both the addresses and countries list in order to avoid case SenSitIve mismatches.

Ryo Shiina · Accepted Answer · 2018-05-12 00:38:47Z

0

If we consider your regex to be working with French addresses (ending by France), then you can use this :

/,\s([A-Z][A-Za-z\s-]+)\sFrance/gm

Link to the online regex simulator where I tested the expression

You mentioned earlier about the United States, but actually the way adresses are written is totally different, so you'll have to make another regex for it, I guess. (i.e: 4 Cross Lane Schererville, IN 46375)

edited May 12, 2018 at 0:38

answered May 12, 2018 at 0:32

Ryo Shiina

5784 silver badges13 bronze badges

Comments

Anton vBR · Accepted Answer · 2018-05-12 00:55:40Z

Yeah maybe some advanced regex could handle this but the pandas naive approach would be:

import pandas as pd
import numpy as np

col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
                 '10 rue des Treuils BP 12 33023, Les Deux Alpes France',
                 '10 rue des Treuils BP 12 33023, New York United States'])

cities = np.where(col.str.endswith('United States'), 
                  col.str.split(', ').str[1].str.split().str[:-2].str.join(' '), 
                  col.str.split(', ').str[1].str.split().str[:-1].str.join(' '))

print(cities)
#['Bordeaux' 'Les Deux Alpes' 'New York']

A more general but not as effective solution (but who needs speed right?)

import pandas as pd

col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
                 '10 rue des Treuils BP 12 33023, New York United States',
                 '10 rue des Treuils BP 12 33023, Seoul South Korea',
                 '10 rue des Treuils BP 12 33023, Brazzaville Republic of Congo'])

countries = {'United States': 2 , 'South Korea': 2, 'Republic of Congo': 3}
n = [next((countries[k] for k,v in countries.items() if i.endswith(k)), 1) for i in col]
cities = [' '.join(i.split(', ')[1].split()[:-y]) for i,y in zip(col,n)]

print(cities)
# ['Bordeaux', 'Les Deux Alpes', 'New York', 'Seoul', 'Brazzaville']

And then simply assign back with:

df['city'] = cities

Collectives™ on Stack Overflow

How to extract string from dataframe after regex matching

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related