Replace function not working in data frame

Question

I have successfully extracted the 2 sheets of data and appended but I want to clean the phone number field. The replace line is not erroring but also not doing anything.

Is there another method I should use to clean the phone number string?

filename = 'file.xlsx'
wb = xw.Book(filename)
sheet1 = wb.sheets['sheet1']
df1 = sheet1.used_range.options(pd.DataFrame, index=False, header=True).value
sheet2 = wb.sheets['sheet2']
df2 = sheet2.used_range.options(pd.DataFrame, index=False, header=True).value
wb.close()
lists_combined = pd.concat([df1, df2])
lists_combined['filename'] = filename

lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].replace('-','').replace('(','').replace(')','').replace('+','').replace(' ','')

lists_combined = lists_combined.filter(items=['filename','CustomerEmail', 'CustomerVoicePhone','CustomerTextPhone'])

I'm not too sure about the pandas replace function and if it can be used like that, but as an alternative, you could try to use regex with the pandas apply() function — Vedank Pande
– Vedank Pande, Commented Apr 12, 2021 at 19:31

Guillaume Ansanay-Alex · Accepted Answer · 2021-04-13 16:39:41Z

1

You can apply to all the rows a filtering lambda function which takes every character and only keeps digits:

lists_combined['CustomerVoicePhone'] = (lists_combined.CustomerVoicePhone
                                                      .map(lambda x: ''.join(filter(str.isdigit, x))))

In terms of performance, we can compare it with the other answer in the following code, and see that it's a bit faster for a large dataframe (100k phone numbers):

def gen_phone():
    first = str(random.randint(100,999))
    second = str(random.randint(1,888)).zfill(3)
    last = (str(random.randint(1,9998)).zfill(4))
    while last in ['1111','2222','3333','4444','5555','6666','7777','8888']:
        last = (str(random.randint(1,9998)).zfill(4))
    return '{}-{}-{}'.format(first,second, last)

df = pd.DataFrame(columns=['p'])
for _ in range(100000):
    p = gen_phone()
    df = df.append({'p':p}, ignore_index=True)

def method1():
    regex = '\)|\(|-|\+|\s' #or regex = '[\(\)\+\-\s]' using character class
    df['p_1'] = (df['p'].str.replace(regex,'')
                                 .fillna(df['p']))

%time method1()
# Wall time: 166 ms

def method2():
    df['p_2'] = (df.p.map(lambda x: ''.join(filter(str.isdigit, x))))

%time method2()
# Wall time: 151 ms

edited Apr 13, 2021 at 16:39

answered Apr 12, 2021 at 19:47

Guillaume Ansanay-Alex

1,2672 gold badges11 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user8408080 Over a year ago

Code only answers are not suitable for SO, you need to explain what you are doing here

Guillaume Ansanay-Alex Over a year ago

@BenDegroot I have explained and benchmarked this answer, does it solve your problem?

user8408080 Over a year ago

I didn't ask the original question; I came here to moderate, because someone flagged your answer as very low quality. Now it is much better! +1 from me

Scott Boston · Accepted Answer · 2021-04-12 20:05:28Z

0

Let's use .str access with repace and a regex:

regex = '\)|\(|-|\+|\s' #or regex = '[\(\)\+\-\s]' using character class
lists_combined['CustomerVoicePhone'] = (lists_combined['CustomerVoicePhone'].str.replace(regex,'')
                                 .fillna(list_combine['CustomerVoicePhone']))

edited Apr 12, 2021 at 20:05

answered Apr 12, 2021 at 19:48

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

8 Comments

Ben Degroot Over a year ago

Thanks for the help! That worked for all the lines that had characters but returned blank rows for the phone numbers that were already clean. Any ideas on how to modify?

Scott Boston Over a year ago

Add .fillna(lists_combined['CustomerVoicePhone']) at the end.

Scott Boston Over a year ago

@BenDegroot did this help?

Ben Degroot Over a year ago

Yes it did, except one more small issue. The lines that were already clean is not giving me the right 10 characters. I tried adding another .str[-10:] after the .fillna but that nulled it. lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].str.replace(regex,'').str.replace(' ','').str[-10:].fillna(lists_combined['CustomerVoicePhone'])

Ben Degroot Over a year ago

I figured out the issue. I stepped through the format that xl wings was grabbing from the excel file and found it was classify some lines as floats adding a .0. I normalized the df and then removed the '.0'. That did the trick! Thanks so much! lists_combined = lists_combined.astype(str) lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].str.replace(r'.0$', '') lists_combined['CustomerTextPhone'] = lists_combined['CustomerTextPhone'].str.replace(r'.0$', '')

|

Harolds · Accepted Answer · 2021-04-12 19:42:16Z

0

First you should avoid your serie of replace which impact the lisibility of your code .You could use a list inside of the replace fonction for the elements you want to replace by an empty string...

But the main pb of your code is that it should be : df.str.replace() to replace and not just df.replace()

Cheers

answered Apr 12, 2021 at 19:42

Harolds

12 bronze badges

Collectives™ on Stack Overflow

Replace function not working in data frame

3 Answers 3

3 Comments

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related