Pandas: explode a column into multiple rows

Question

Require some assistance in splitting a field based on regex in Pandas & creating a Dataframe.

A	B	C
1129	19-APR-2021	Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816 City: Anchorage_Alaska , Zip: 99506 , 501thru524
1139	20-APR-2021	Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190 City: Atlanta_Georgia , Zip: 30301 , 301thru381

In one of the column C, multiple City & Zip Code details need to be extracted and normalized in the below format :

No	Date	City	Zip
1129	19-APR-2021	Huntsville_Alabama	35808
1129	19-APR-2021	Anchorage_Alaska	99506
1139	20-APR-2021	Miami_Florida	33128
1139	20-APR-2021	Atlanta_Georgia	30301

My re.findall expression is as below & works fine :

city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"    (https://regex101.com/r/VM8oFF/1)
zip_regex_extract = r"[0-9]{5}"                            (https://regex101.com/r/oBYJZX/1)

Below is the code so far, however unable to add Zip field to the same.

import pandas as pd
import json, re, sys, time


df = pd.DataFrame({
   'No': ['1129', '1139'],
   'Date': ['19-APR-2021','20-APR-2021'],
   'C': ['Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816  City: Anchorage_Alaska , Zip: 99506 , 501thru524','Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190  City: Atlanta_Georgia , Zip: 30301 , 301thru381'] 
})


city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"
zip_regex_extract = r"[0-9]{17}"


df['City'] =  [re.findall(city_regex_extract, str(x)) for x in df['C']]
df['Zip'] =  [re.findall(zip_regex_extract, str(x)) for x in df['C']]

df = (df
.set_index(['No','Date'])['City']
.apply(pd.Series)
.stack()
.reset_index()
.drop('level_2', axis=1)
.rename(columns={0:'City'}))

print(df)

Appreciate any help.

Shubham Sharma · Accepted Answer · 2021-04-30 14:40:48Z

3

`Series.str.extractall`

s = df['C'].str.extractall(r'City:\s*(?P<City>[^,]+?)\s*,\s*Zip:\s*(?P<Zip>\d+)')
df[['No', 'Date']].join(s.droplevel(1))

     No         Date                City    Zip
0  1129  19-APR-2021  Huntsville_Alabama  35808
0  1129  19-APR-2021    Anchorage_Alaska  99506
1  1139  20-APR-2021       Miami_Florida  33128
1  1139  20-APR-2021     Atlanta_Georgia  30301

Regex details:

City: : Matches the characters City: literally
\s* : Matches zero or more whitespace characters
(?P<City>[^,]+?): First named capturing group
- [^,]+?: Matches any character expect , one or more times but as few times as possible
\s*,\s* : Matches zero or more space followed by comma followed by zero or more spaces
Zip: : Matches the characters Zip: literally
\s* : Matches zero or more whitespace characters
(?P<Zip>\d+): Second named capturing group
- \d+: Matches a digit one or more times

See the online regex demo

edited Apr 30, 2021 at 14:40

answered Apr 30, 2021 at 14:26

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

pats4u Over a year ago

Thanks Shubham . However I am unable to get the above output. Not sure if I missed something, below is the output : No Date C 0 1129 19-APR-2021 Zip Code Details: City: Huntsville_Alabama , Z... 1 1139 20-APR-2021 Zip Code Details: City: Miami_Florida , Zip: 3...

Shubham Sharma Over a year ago

How many rows are you getting after using the above approach 2 or 4?

pats4u Over a year ago

I am able to view only 2 records without any extraction of City or Zip.

Shubham Sharma Over a year ago

df[['No', 'Date']].join(s.droplevel(1)) is not a inplace operation you have to assign it back to some variable. For example out = df[['No', 'Date']].join(s.droplevel(1)) now check the value of out

Andreas · Accepted Answer · 2021-04-30 15:22:24Z

1

You actually don't even need regex library in my oppinion, pandas has regex included, therefore you can split on:

df['C'] = df['C'].str.split(' City: ').str[1:]
df = df.explode('C')
df[['City','Zip']] = df['C'].str.split(' , Zip: | , ', expand=True).iloc[:,:2]

print(df)

     No         Date                City    Zip
0  1129  19-APR-2021  Huntsville_Alabama  35808
0  1129  19-APR-2021    Anchorage_Alaska  99506
1  1139  20-APR-2021       Miami_Florida  33128
1  1139  20-APR-2021     Atlanta_Georgia  30301

The expand=True parameter allows to retrieve multiple columns at once. The .iloc[] is used to select whcih values to use after the split occured.

edited Apr 30, 2021 at 15:22

answered Apr 30, 2021 at 14:12

Andreas

9,2854 gold badges20 silver badges47 bronze badges

2 Comments

pats4u Over a year ago

Thank you Andreas, however I am unable to get 4 records . only Anchorage_Alaska & Atlanta_Georgia are retrieved.

Andreas Over a year ago

@pats4u oh, you were right, fixed it. Sorry about the confusion.

Gusti Adli · Accepted Answer · 2021-04-30 14:24:25Z

1

Try .explode() on City and Zip, followed by reset_index(), and finally joining both explode results on index

df.explode('City').reset_index()[['No', 'Date', 'City']]\
    .join(df.explode('Zip').reset_index()[['Zip']])

answered Apr 30, 2021 at 14:24

Gusti Adli

1,2236 silver badges13 bronze badges

Collectives™ on Stack Overflow

Pandas: explode a column into multiple rows

3 Answers 3

`Series.str.extractall`

4 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related