0

I've got a bunch of addresses like so:

df['street'] =
    5311 Whitsett Ave 34
    355 Sawyer St
    607 Hampshire Rd #358
    342 Old Hwy 1
    267 W Juniper Dr 402

What I want to do is to remove those numbers at the end of the street part of the addresses to get:

df['street'] =
    5311 Whitsett Ave
    355 Sawyer St
    607 Hampshire Rd
    342 Old Hwy 1
    267 W Juniper Dr

I have my regular expression like this:

df['street'] = df.street.str.replace(r"""\s(?:dr|ave|rd)[^a-zA-Z]\D*\d+$""", '', case=False)

which gives me this:

df['street'] =
    5311 Whitsett
    355 Sawyer St
    607 Hampshire
    342 Old Hwy 1
    267 W Juniper

It dropped the words 'Ave', 'Rd' and 'Dr' from my original street addresses. Is there a way to keep part of the regular expression pattern (in my case this is 'Ave', 'Rd', 'Dr' and replace the rest?

EDIT: Notice the address 342 Old Hwy 1. I do not want to also take out the number in such cases. That's why I specified the patterns ('Ave', 'Rd', 'Dr', etc) to have a better control of who gets changed.

5
  • just use this r"\s*#?\d+$" regex Commented Oct 16, 2015 at 16:39
  • @AvinashRaj Sorry, I don't understand the suggestion you made. Can you please elaborate? Commented Oct 16, 2015 at 16:45
  • try uu.street.str.replace(r"\s*#?\d+$", '', case=False) Commented Oct 16, 2015 at 16:51
  • in default python, i should use re.sub(regex, replace, string) Commented Oct 16, 2015 at 16:52
  • Sorry what I don't get is how the pattern you suggested does the job for my situation. I get that \s* matches 0 or more spaces, not sure what "#?" means, then \d+$ is my ending condition. Commented Oct 16, 2015 at 17:04

2 Answers 2

1
    df_street = '''
        5311 Whitsett Ave 34
        355 Sawyer St
        607 Hampshire Rd #358
        342 Old Hwy 1
        267 W Juniper Dr 402
        '''
    # digits on the end are preceded by one of ( Ave, Rd, Dr), space,
    # may be preceded by a #, and followed by a possible space, and by the newline
   df_street = re.sub(r'(Ave|Rd|Dr)\s+#?\d+\s*\n',r'\1\n', df_street,re.MULTILINE|re.IGNORECASE)
print(df_street)

    5311 Whitsett Ave
    355 Sawyer St
    607 Hampshire Rd
    342 Old Hwy 1
    267 W Juniper Dr
Sign up to request clarification or add additional context in comments.

2 Comments

Your solution does not keep the 'Ave', 'Rd', or 'Dr' either. I want to keep them.
That's perfect. \1 is what I was looking for!
0

You should use the following regex:

>>> import re
>>> example_str = "607 Hampshire Rd #358"
>>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str)
'607 Hampshire Rd'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.