0

I have successfully extracted the 2 sheets of data and appended but I want to clean the phone number field. The replace line is not erroring but also not doing anything.

Is there another method I should use to clean the phone number string?

filename = 'file.xlsx'
wb = xw.Book(filename)
sheet1 = wb.sheets['sheet1']
df1 = sheet1.used_range.options(pd.DataFrame, index=False, header=True).value
sheet2 = wb.sheets['sheet2']
df2 = sheet2.used_range.options(pd.DataFrame, index=False, header=True).value
wb.close()
lists_combined = pd.concat([df1, df2])
lists_combined['filename'] = filename

lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].replace('-','').replace('(','').replace(')','').replace('+','').replace(' ','')

lists_combined = lists_combined.filter(items=['filename','CustomerEmail', 'CustomerVoicePhone','CustomerTextPhone'])
2
  • I'm not too sure about the pandas replace function and if it can be used like that, but as an alternative, you could try to use regex with the pandas apply() function Commented Apr 12, 2021 at 19:31
  • Try .replace(['-', '(', ')', '+', ' '], '']) Commented Apr 12, 2021 at 19:43

3 Answers 3

1

You can apply to all the rows a filtering lambda function which takes every character and only keeps digits:

lists_combined['CustomerVoicePhone'] = (lists_combined.CustomerVoicePhone
                                                      .map(lambda x: ''.join(filter(str.isdigit, x))))

In terms of performance, we can compare it with the other answer in the following code, and see that it's a bit faster for a large dataframe (100k phone numbers):

def gen_phone():
    first = str(random.randint(100,999))
    second = str(random.randint(1,888)).zfill(3)
    last = (str(random.randint(1,9998)).zfill(4))
    while last in ['1111','2222','3333','4444','5555','6666','7777','8888']:
        last = (str(random.randint(1,9998)).zfill(4))
    return '{}-{}-{}'.format(first,second, last)

df = pd.DataFrame(columns=['p'])
for _ in range(100000):
    p = gen_phone()
    df = df.append({'p':p}, ignore_index=True)

def method1():
    regex = '\)|\(|-|\+|\s' #or regex = '[\(\)\+\-\s]' using character class
    df['p_1'] = (df['p'].str.replace(regex,'')
                                 .fillna(df['p']))

%time method1()
# Wall time: 166 ms

def method2():
    df['p_2'] = (df.p.map(lambda x: ''.join(filter(str.isdigit, x))))

%time method2()
# Wall time: 151 ms
Sign up to request clarification or add additional context in comments.

3 Comments

Code only answers are not suitable for SO, you need to explain what you are doing here
@BenDegroot I have explained and benchmarked this answer, does it solve your problem?
I didn't ask the original question; I came here to moderate, because someone flagged your answer as very low quality. Now it is much better! +1 from me
0

Let's use .str access with repace and a regex:

regex = '\)|\(|-|\+|\s' #or regex = '[\(\)\+\-\s]' using character class
lists_combined['CustomerVoicePhone'] = (lists_combined['CustomerVoicePhone'].str.replace(regex,'')
                                 .fillna(list_combine['CustomerVoicePhone']))

8 Comments

Thanks for the help! That worked for all the lines that had characters but returned blank rows for the phone numbers that were already clean. Any ideas on how to modify?
Add .fillna(lists_combined['CustomerVoicePhone']) at the end.
@BenDegroot did this help?
Yes it did, except one more small issue. The lines that were already clean is not giving me the right 10 characters. I tried adding another .str[-10:] after the .fillna but that nulled it. lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].str.replace(regex,'').str.replace(' ','').str[-10:].fillna(lists_combined['CustomerVoicePhone'])
I figured out the issue. I stepped through the format that xl wings was grabbing from the excel file and found it was classify some lines as floats adding a .0. I normalized the df and then removed the '.0'. That did the trick! Thanks so much! lists_combined = lists_combined.astype(str) lists_combined['CustomerVoicePhone'] = lists_combined['CustomerVoicePhone'].str.replace(r'.0$', '') lists_combined['CustomerTextPhone'] = lists_combined['CustomerTextPhone'].str.replace(r'.0$', '')
|
0

First you should avoid your serie of replace which impact the lisibility of your code .You could use a list inside of the replace fonction for the elements you want to replace by an empty string...

But the main pb of your code is that it should be : df.str.replace() to replace and not just df.replace()

Cheers

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.