Python pandas regular expression replace part of the matching pattern

Question

I've got a bunch of addresses like so:

df['street'] =
    5311 Whitsett Ave 34
    355 Sawyer St
    607 Hampshire Rd #358
    342 Old Hwy 1
    267 W Juniper Dr 402

What I want to do is to remove those numbers at the end of the street part of the addresses to get:

df['street'] =
    5311 Whitsett Ave
    355 Sawyer St
    607 Hampshire Rd
    342 Old Hwy 1
    267 W Juniper Dr

I have my regular expression like this:

df['street'] = df.street.str.replace(r"""\s(?:dr|ave|rd)[^a-zA-Z]\D*\d+$""", '', case=False)

which gives me this:

df['street'] =
    5311 Whitsett
    355 Sawyer St
    607 Hampshire
    342 Old Hwy 1
    267 W Juniper

It dropped the words 'Ave', 'Rd' and 'Dr' from my original street addresses. Is there a way to keep part of the regular expression pattern (in my case this is 'Ave', 'Rd', 'Dr' and replace the rest?

EDIT: Notice the address 342 Old Hwy 1. I do not want to also take out the number in such cases. That's why I specified the patterns ('Ave', 'Rd', 'Dr', etc) to have a better control of who gets changed.

@AvinashRaj Sorry, I don't understand the suggestion you made. Can you please elaborate? — breezymri
– breezymri, Commented Oct 16, 2015 at 16:45
in default python, i should use re.sub(regex, replace, string) — Avinash Raj
– Avinash Raj, Commented Oct 16, 2015 at 16:52
Sorry what I don't get is how the pattern you suggested does the job for my situation. I get that \s* matches 0 or more spaces, not sure what "#?" means, then \d+$ is my ending condition. — breezymri
– breezymri, Commented Oct 16, 2015 at 17:04

LetzerWille · Accepted Answer · 2015-10-17 00:14:21Z

1

    df_street = '''
        5311 Whitsett Ave 34
        355 Sawyer St
        607 Hampshire Rd #358
        342 Old Hwy 1
        267 W Juniper Dr 402
        '''
    # digits on the end are preceded by one of ( Ave, Rd, Dr), space,
    # may be preceded by a #, and followed by a possible space, and by the newline
   df_street = re.sub(r'(Ave|Rd|Dr)\s+#?\d+\s*\n',r'\1\n', df_street,re.MULTILINE|re.IGNORECASE)
print(df_street)

    5311 Whitsett Ave
    355 Sawyer St
    607 Hampshire Rd
    342 Old Hwy 1
    267 W Juniper Dr

edited Oct 17, 2015 at 0:14

answered Oct 16, 2015 at 17:45

LetzerWille

5,6965 gold badges26 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

breezymri Over a year ago

Your solution does not keep the 'Ave', 'Rd', or 'Dr' either. I want to keep them.

breezymri Over a year ago

That's perfect. \1 is what I was looking for!

Mayur Koshti · Accepted Answer · 2015-10-16 17:05:17Z

0

You should use the following regex:

>>> import re
>>> example_str = "607 Hampshire Rd #358"
>>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str)
'607 Hampshire Rd'

answered Oct 16, 2015 at 17:05

Mayur Koshti

1,89218 silver badges21 bronze badges

Collectives™ on Stack Overflow

Python pandas regular expression replace part of the matching pattern

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related