2

Require some assistance in splitting a field based on regex in Pandas & creating a Dataframe.

A B C
1129 19-APR-2021 Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816 City: Anchorage_Alaska , Zip: 99506 , 501thru524
1139 20-APR-2021 Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190 City: Atlanta_Georgia , Zip: 30301 , 301thru381

In one of the column C, multiple City & Zip Code details need to be extracted and normalized in the below format :

No Date City Zip
1129 19-APR-2021 Huntsville_Alabama 35808
1129 19-APR-2021 Anchorage_Alaska 99506
1139 20-APR-2021 Miami_Florida 33128
1139 20-APR-2021 Atlanta_Georgia 30301

My re.findall expression is as below & works fine :

city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"    (https://regex101.com/r/VM8oFF/1)
zip_regex_extract = r"[0-9]{5}"                            (https://regex101.com/r/oBYJZX/1)

Below is the code so far, however unable to add Zip field to the same.

import pandas as pd
import json, re, sys, time


df = pd.DataFrame({
   'No': ['1129', '1139'],
   'Date': ['19-APR-2021','20-APR-2021'],
   'C': ['Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816  City: Anchorage_Alaska , Zip: 99506 , 501thru524','Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190  City: Atlanta_Georgia , Zip: 30301 , 301thru381'] 
})


city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"
zip_regex_extract = r"[0-9]{17}"


df['City'] =  [re.findall(city_regex_extract, str(x)) for x in df['C']]
df['Zip'] =  [re.findall(zip_regex_extract, str(x)) for x in df['C']]

df = (df
.set_index(['No','Date'])['City']
.apply(pd.Series)
.stack()
.reset_index()
.drop('level_2', axis=1)
.rename(columns={0:'City'}))

print(df)

Appreciate any help.

3 Answers 3

3

Series.str.extractall

s = df['C'].str.extractall(r'City:\s*(?P<City>[^,]+?)\s*,\s*Zip:\s*(?P<Zip>\d+)')
df[['No', 'Date']].join(s.droplevel(1))

     No         Date                City    Zip
0  1129  19-APR-2021  Huntsville_Alabama  35808
0  1129  19-APR-2021    Anchorage_Alaska  99506
1  1139  20-APR-2021       Miami_Florida  33128
1  1139  20-APR-2021     Atlanta_Georgia  30301

Regex details:

  • City: : Matches the characters City: literally
  • \s* : Matches zero or more whitespace characters
  • (?P<City>[^,]+?): First named capturing group
    • [^,]+?: Matches any character expect , one or more times but as few times as possible
  • \s*,\s* : Matches zero or more space followed by comma followed by zero or more spaces
  • Zip: : Matches the characters Zip: literally
  • \s* : Matches zero or more whitespace characters
  • (?P<Zip>\d+): Second named capturing group
    • \d+: Matches a digit one or more times

See the online regex demo

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Shubham . However I am unable to get the above output. Not sure if I missed something, below is the output : No Date C 0 1129 19-APR-2021 Zip Code Details: City: Huntsville_Alabama , Z... 1 1139 20-APR-2021 Zip Code Details: City: Miami_Florida , Zip: 3...
How many rows are you getting after using the above approach 2 or 4?
I am able to view only 2 records without any extraction of City or Zip.
df[['No', 'Date']].join(s.droplevel(1)) is not a inplace operation you have to assign it back to some variable. For example out = df[['No', 'Date']].join(s.droplevel(1)) now check the value of out
1

You actually don't even need regex library in my oppinion, pandas has regex included, therefore you can split on:

df['C'] = df['C'].str.split(' City: ').str[1:]
df = df.explode('C')
df[['City','Zip']] = df['C'].str.split(' , Zip: | , ', expand=True).iloc[:,:2]

print(df)

     No         Date                City    Zip
0  1129  19-APR-2021  Huntsville_Alabama  35808
0  1129  19-APR-2021    Anchorage_Alaska  99506
1  1139  20-APR-2021       Miami_Florida  33128
1  1139  20-APR-2021     Atlanta_Georgia  30301

The expand=True parameter allows to retrieve multiple columns at once. The .iloc[] is used to select whcih values to use after the split occured.

2 Comments

Thank you Andreas, however I am unable to get 4 records . only Anchorage_Alaska & Atlanta_Georgia are retrieved.
@pats4u oh, you were right, fixed it. Sorry about the confusion.
1

Try .explode() on City and Zip, followed by reset_index(), and finally joining both explode results on index

df.explode('City').reset_index()[['No', 'Date', 'City']]\
    .join(df.explode('Zip').reset_index()[['Zip']])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.