0

I've got a list of addresses in a single column address, how would I go about parsing the phone number and restaurant category into new columns? My dataframe looks like this

  address
0 Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 310-246-1501 Steakhouses                                                                    
1 Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221 Delis                                                                                             
2 Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 French Bistro 

where I want to get

  address | phone_number | category
0 Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles | 310-246-1501 | Steakhouses                                                                    
1 Art's Deli 12224 Ventura Blvd. Studio City | 818-762-1221 | Delis                                                                                             
2 Bel-Air Hotel 701 Stone Canyon Rd. Bel Air | 310-472-1211 | French Bistro 

Does anybody have any suggestions?

2
  • Is the address always at the end, like you've shown in your example? Commented Jul 29, 2019 at 12:10
  • tried this method? the regex could e.g. be '[0-9]{3}-[0-9]{3}-[0-9]{4}' Commented Jul 29, 2019 at 12:11

2 Answers 2

3

Try using Regex with str.extract.

Ex:

df = pd.DataFrame({'address':["Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 310-246-1501 Steakhouses", 
                              "Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221 Delis",
                              "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 French Bistro"]})
df[["address", "phone_number", "category"]] = df["address"].str.extract(r"(?P<address>.*?)(?P<phone_number>\b\d{3}\-\d{3}\-\d{4}\b)(?P<category>.*$)")
print(df)

Output:

                                             address  phone_number  \
0  Arnie Morton's of Chicago 435 S. La Cienega Bl...  310-246-1501   
1        Art's Deli 12224 Ventura Blvd. Studio City   818-762-1221   
2        Bel-Air Hotel 701 Stone Canyon Rd. Bel Air   310-472-1211   

         category  
0     Steakhouses  
1           Delis  
2   French Bistro  

Note:: Assuming the content of address is always address--phone_number--category

Sign up to request clarification or add additional context in comments.

Comments

1

Using str.extract and str.split:

  1. We extract the pattern numbers dash numbers dash numbers for phone_number
  2. We split on the pattern 3 numbers followed by a space and grab the part after it for category. We use positive lookbehind for this, which is ?<= in regex
df['phone_number'] = df['address'].str.extract('(\d+-\d+-\d+)')
df['category'] = df['address'].str.split('(?<=\d{3})\s').str[-1]

Output

                                                                                  address  phone_number       category
0  Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 310-246-1501 Steakhouses  310-246-1501    Steakhouses
1                           Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221 Delis  818-762-1221          Delis
2                   Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 French Bistro  310-472-1211  French Bistro

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.