1

I'm running a process to clean up some telephone numbers (UK) and have decided to run a lambda function across a Pandas DataFrame using regex/substitution to remove characters that I do not want to include (non-numeric, allowing a +)

Code is as follows: (phone_test is just a DataFrame of test examples, two columns, an index and the values)

def clean_phone_number(tel_no):
    for row in test_data:
        row = re.sub('[^?0-9+]+', '', row)
        return(row)

phone_test_result = phone_test['TEL_NUMBER'].apply(lambda x: clean_phone_number(x))

The problem that I've got is that is that the outcome (phone_test_result) just returns the index of the phone_test dataframe and not the newly formatted telephone number. I've been wracking my brain for a couple of hours but I'm sure its a simple problem.

At first I thought it was just the positioning of the return line (it should be under the for, right?) but when I do that I just get an output of a single phone number, repeated for the length of the loop (that isnt even in the phone_test dataframe!)

PLS HALP SO. thank you.


after the responses, this is what I've ended up with:

clean the phone number using regex and only take the first 13 characters
- substituting a leading zero with +44
- deleting everything with a length of less than 13 characters.
It's not perfect;
- there are some phone numbers with legit less digits
- means i trim out all of the extension numbers

def clean_phone_number(tel_no):
    clean_tel = re.sub('[^?0-9+]+', '', tel_no)[:13]
    if clean_tel[:1] == '0':
        clean_tel = '+44'+clean_tel[1:]
        if len(clean_tel) < 13:
            clean_tel = ''
    return(clean_tel)

2 Answers 2

3

pd.Series.apply applies a function to each value in a series. Notice lambda is unnecessary.

import re

phone_test = pd.DataFrame({'TEL_NUMBER': ['+44-020841396', '+44-07721-051-851']})

def clean_phone_number(tel_no):
     return re.sub('[^?0-9+]+', '', tel_no)

phone_test_result = phone_test['TEL_NUMBER'].apply(clean_phone_number)

# 0      +44020841396
# 1    +4407721051851
# Name: TEL_NUMBER, dtype: object

pd.DataFrame.apply, in contrast, applies a function to each row in a dataframe:

def clean_phone_number(row):
     return re.sub('[^?0-9+]+', '', row['TEL_NUMBER'])

phone_test_result = phone_test.apply(clean_phone_number, axis=1)

# 0      +44020841396
# 1    +4407721051851
# Name: TEL_NUMBER, dtype: object
Sign up to request clarification or add additional context in comments.

Comments

2

You don't have to loop , the function will be executed for each element

def clean_phone_number(tel_no):
    return re.sub('[^?0-9+]+', '', tel_no)

or directly

phone_test_result = phone_test['TEL_NUMBER'].apply(lambda x: re.sub('[^?0-9+]+', '', x))

3 Comments

You don't have to loop. To clarify, apply is just a thinly veiled loop.
Yes , It's in the sense " don't need to write your own for loop", but it's true it's better to have your clarification here if the OP is unaware of that :)
thank you very much for this clarification - I've been writing python full time for a month now and have been including loops in functions that i intend to use with apply... continually!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.