Require some assistance in splitting a field based on regex in Pandas & creating a Dataframe.
| A | B | C |
|---|---|---|
| 1129 | 19-APR-2021 | Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816 City: Anchorage_Alaska , Zip: 99506 , 501thru524 |
| 1139 | 20-APR-2021 | Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190 City: Atlanta_Georgia , Zip: 30301 , 301thru381 |
In one of the column C, multiple City & Zip Code details need to be extracted and normalized in the below format :
| No | Date | City | Zip |
|---|---|---|---|
| 1129 | 19-APR-2021 | Huntsville_Alabama | 35808 |
| 1129 | 19-APR-2021 | Anchorage_Alaska | 99506 |
| 1139 | 20-APR-2021 | Miami_Florida | 33128 |
| 1139 | 20-APR-2021 | Atlanta_Georgia | 30301 |
My re.findall expression is as below & works fine :
city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*" (https://regex101.com/r/VM8oFF/1)
zip_regex_extract = r"[0-9]{5}" (https://regex101.com/r/oBYJZX/1)
Below is the code so far, however unable to add Zip field to the same.
import pandas as pd
import json, re, sys, time
df = pd.DataFrame({
'No': ['1129', '1139'],
'Date': ['19-APR-2021','20-APR-2021'],
'C': ['Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816 City: Anchorage_Alaska , Zip: 99506 , 501thru524','Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190 City: Atlanta_Georgia , Zip: 30301 , 301thru381']
})
city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"
zip_regex_extract = r"[0-9]{17}"
df['City'] = [re.findall(city_regex_extract, str(x)) for x in df['C']]
df['Zip'] = [re.findall(zip_regex_extract, str(x)) for x in df['C']]
df = (df
.set_index(['No','Date'])['City']
.apply(pd.Series)
.stack()
.reset_index()
.drop('level_2', axis=1)
.rename(columns={0:'City'}))
print(df)
Appreciate any help.