2

Problem

How to replace X with _, given the following dataframe:

data = {'street':['13XX First St', '2XXX First St', '47X Second Ave'], 
        'city':['Ashland', 'Springfield', 'Ashland']} 
df = pd.DataFrame(data) 

The streets need to be edited, replacing each X with an underscore _.

Notice that the number of Integers changes, as does the number of Xs. Also, street names such as Xerxes should not be edited to _er_es, but rather left unedited. Only the street number section should change.

Desired Output

data = {'street':['13__ First St', '2___ First St', '47_ Second Ave'], 
        'city':['Ashland', 'Springfield', 'Ashland']} 
df = pd.DataFrame(data) 

Progress

Some potential regex building blocks include:
1. [0-9]+ to capture numbers
2. X+ to capture Xs
3. ([0-9]+)(X+) to capture groups

df['street']replace("[0-9]+)(X+)", value=r"\2", regex=True, inplace=False)

I'm pretty weak with regex, so my approach may not be the best. Preemptive thank you for any guidance or solutions!

2
  • you want to _ with the number of times X appears? is if it was 13XXX then you want 13___ (three underscores) ? Commented Jan 9, 2020 at 16:48
  • 1
    @Datanovice exactly so, 2 X should be replaced by 2 _. X -> _, XX -> __, XXX -> ___. Commented Jan 13, 2020 at 19:47

3 Answers 3

3

IIUC, this would do:

def repl(m):
    return m.group(1) + '_'*len(m.group(2))

df['street'].str.replace("^([0-9]+)(X*)", repl)

Output:

0     13__ First St
1     2___ First St
2    47_ Second Ave
Name: street, dtype: object
Sign up to request clarification or add additional context in comments.

5 Comments

i couldn't get a function to work in df.replace - do you know why? it replaces the entire string with <function repl at 0x000001C242C68268>
You need .str.replace, which accepts a function, not replace.
that's right, but if you wanted to make the change across the entire dataframe you would need to loop through every column to use str.replace right?
Yes, or df.apply(lamba x: x.str.replace(...)
This is correct you need the str.replace to run this. It wont take just df.replace. Good work around
2

IIUC, we can pass a function into the repl argument much like re.sub

def repl(m):
    return '_' * len(m.group())

df['street'].str.replace(r'([X])+',repl)

out:

0     13__ First St
1     2___ First St
2    47_ Second Ave
Name: street, dtype: object

if you need to match only after numbers, we can add a '\d{1}' which will only match after a single instance of X

df['street'].str.replace(r'\d{1}([X]+)+',repl)

Comments

0

Assuming 'X' only occurs in the 'street' column

streetresult=re.sub('X','_',str(df['street']))

Your desired output should be the result

Code I tested

import pandas as pd
import re

data = {'street':['13XX First St', '2XXX First St', '47X Second Ave'], 
        'city':['Ashland', 'Springfield', 'Ashland']} 
df = pd.DataFrame(data) 
for  i in data:
    streetresult=re.sub('X','_',str(df['street']))
print(streetresult)

3 Comments

This will replace X in 123 Xmas Street as well.
This is correct, setting the regex rules of if it following a $\d (numeric value) or an $'X' should account for street names such as that. If I'm not mistaken
@SublimizeD sorry, I hadn't made that clarification in the problem, but Quang's correct in pointing out that requirement. I'll edit the problem. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.