0

I'm working with a dataframe which contains addresses and I want to delete a specfic part of a string. Like for example dataset of addresses

And I want to delete the string since taking the words "REFERENCE:" and "reference:" to the end of the sentence. Also I want to create a new column that looks something like this (without the word REFERENCE:/reference: and the next letter of those words) Could you help me to do it in Regex? I want that it the new column looks something like this: edit_column

1
  • 1
    You should put the code you have and the outputs in text so we could easily work on them. Commented Sep 23, 2020 at 1:30

2 Answers 2

1

You can use some regex to obtain the desired results.

df = pd.DataFrame({"address": ["Street Pases de la Reforma #200 REFERENCE: Green house", "Street Carranza #300 12 & 13 REFERENCE: There is a tree"]})

df.address.str.findall(r".+?(?=REFERENCE)").explode()

0    Street Pases de la Reforma #200 
1       Street Carranza #300 12 & 13

Explanation of the regex pattern:

.+? matches any character (except for line terminators)
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=REFERENCE)
Sign up to request clarification or add additional context in comments.

Comments

1

The regex should look like this:

import re

discard_re = re.compile('(reference:.*)', re.IGNORECASE | re.MULTILINE)

then you can add the new column:

df['address_new'] = df.addresses.map(lambda x: discard_re.sub('', x))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.