2

I've got a column in dataframe I want to clean up by removing the brackets.

1                          Auburn (Auburn University)[1]
2                 Florence (University of North Alabama)
3        Jacksonville (Jacksonville State University)[2]
4             Livingston (University of West Alabama)[2]
5               Montevallo (University of Montevallo)[2]
6                              Troy (Troy University)[2]
7      Tuscaloosa (University of Alabama, Stillman Co...
8                      Tuskegee (Tuskegee University)[5]
10         Fairbanks (University of Alaska Fairbanks)[2]
12            Flagstaff (Northern Arizona University)[6]

I used unitowns['City'].str.replace('\(.*\)','').str.replace('\[.*\]','') to get the intended result as follows-

1                            Auburn 
2                          Florence 
3                      Jacksonville 
4                        Livingston 
5                        Montevallo 
6                              Troy 
7                        Tuscaloosa 
8                          Tuskegee 
10                        Fairbanks 
12                        Flagstaff

Is there a way to combine these expressions? This code does not seem to work -> unitowns['City'].str.replace('(\(.*\)) | (\[.*\])','')

1 Answer 1

5

Option 1
str.extract/str.findall
Rather than removing irrelevant content, why not extract the relevant ones instead?

df.City.str.extract(r'(.*?)(?=\()', expand=False)

Or,

df.City.str.findall(r'(.*?)(?=\()').str[0]

0          Auburn 
1        Florence 
2    Jacksonville 
3      Livingston 
4      Montevallo 
5            Troy 
6      Tuscaloosa 
7        Tuskegee 
8       Fairbanks 
9       Flagstaff 
Name: City, dtype: object

You may also want to get rid of leading/trailing spaces post extraction. You can call str.strip on the result -

df.City = df.City.str.extract(r'(.*?)(?=\()', expand=False).str.strip()

Or,

df.City = df.City.str.findall(r'(.*?)(?=\()').str[0].str.strip()

Regex Details

(      # capture group
.*?    # non-greedy matcher
)
(?=    # lookahead
\(     # opening parenthesis
)

Option 2
str.split
If your city names only consist of one word, str.split would also work.

df.City.str.split('\s', 1).str[0]

0          Auburn
1        Florence
2    Jacksonville
3      Livingston
4      Montevallo
5            Troy
6      Tuscaloosa
7        Tuskegee
8       Fairbanks
9       Flagstaff
Name: City, dtype: object

Option 3
str.replace
Condensing your chained calls, you can use -

df['City'].str.replace(r'\(.*?\)|\[.*?\]', '').str.strip()

0          Auburn
1        Florence
2    Jacksonville
3      Livingston
4      Montevallo
5            Troy
6      Tuscaloosa
7        Tuskegee
8       Fairbanks
9       Flagstaff
Name: City, dtype: object
Sign up to request clarification or add additional context in comments.

2 Comments

Hey thanks! This helps. Although, the reason I want to combine regex is I've got some data that are multiple words with numbers at the beginning and brackets in the middle. Would be easier to format what to remove!
@COLDSPEED Awesome! The non-greedy matcher was the key then!! Thanks. P.S- I ended up using the extract for this one! Thanks for a lot!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.