Combine multiple regex expressions in pandas.DataFrame.str.replace?

Question

I've got a column in dataframe I want to clean up by removing the brackets.

1                          Auburn (Auburn University)[1]
2                 Florence (University of North Alabama)
3        Jacksonville (Jacksonville State University)[2]
4             Livingston (University of West Alabama)[2]
5               Montevallo (University of Montevallo)[2]
6                              Troy (Troy University)[2]
7      Tuscaloosa (University of Alabama, Stillman Co...
8                      Tuskegee (Tuskegee University)[5]
10         Fairbanks (University of Alaska Fairbanks)[2]
12            Flagstaff (Northern Arizona University)[6]

I used unitowns['City'].str.replace('\(.*\)','').str.replace('\[.*\]','') to get the intended result as follows-

1                            Auburn 
2                          Florence 
3                      Jacksonville 
4                        Livingston 
5                        Montevallo 
6                              Troy 
7                        Tuscaloosa 
8                          Tuskegee 
10                        Fairbanks 
12                        Flagstaff

Is there a way to combine these expressions? This code does not seem to work -> unitowns['City'].str.replace('(\(.*\)) | (\[.*\])','')

cs95 · Accepted Answer · 2018-01-07 22:50:39Z

5

Option 1
str.extract/str.findall
Rather than removing irrelevant content, why not extract the relevant ones instead?

df.City.str.extract(r'(.*?)(?=\()', expand=False)

Or,

df.City.str.findall(r'(.*?)(?=\()').str[0]

0          Auburn 
1        Florence 
2    Jacksonville 
3      Livingston 
4      Montevallo 
5            Troy 
6      Tuscaloosa 
7        Tuskegee 
8       Fairbanks 
9       Flagstaff 
Name: City, dtype: object

You may also want to get rid of leading/trailing spaces post extraction. You can call str.strip on the result -

df.City = df.City.str.extract(r'(.*?)(?=\()', expand=False).str.strip()

Or,

df.City = df.City.str.findall(r'(.*?)(?=\()').str[0].str.strip()

Regex Details

(      # capture group
.*?    # non-greedy matcher
)
(?=    # lookahead
\(     # opening parenthesis
)

Option 2
str.split
If your city names only consist of one word, str.split would also work.

df.City.str.split('\s', 1).str[0]

0          Auburn
1        Florence
2    Jacksonville
3      Livingston
4      Montevallo
5            Troy
6      Tuscaloosa
7        Tuskegee
8       Fairbanks
9       Flagstaff
Name: City, dtype: object

Option 3
str.replace
Condensing your chained calls, you can use -

df['City'].str.replace(r'\(.*?\)|\[.*?\]', '').str.strip()

0          Auburn
1        Florence
2    Jacksonville
3      Livingston
4      Montevallo
5            Troy
6      Tuscaloosa
7        Tuskegee
8       Fairbanks
9       Flagstaff
Name: City, dtype: object

edited Jan 7, 2018 at 22:50

answered Jan 7, 2018 at 22:34

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pranshu Over a year ago

Hey thanks! This helps. Although, the reason I want to combine regex is I've got some data that are multiple words with numbers at the beginning and brackets in the middle. Would be easier to format what to remove!

Pranshu Over a year ago

@COLDSPEED Awesome! The non-greedy matcher was the key then!! Thanks. P.S- I ended up using the extract for this one! Thanks for a lot!!

Collectives™ on Stack Overflow

Combine multiple regex expressions in pandas.DataFrame.str.replace?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related