0

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)

   str     int
0   abc    123
1   gbc@*   23abc
2   abc123  abc200
3   124abc  1230&*
4   abcer£$%&*! 230!?*&

num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)

    

Output:

correct_int: ['23', '1230', '230']

As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first

correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']

now it removed for all but sometimes it's not removing when the value is '124abc'

Is my pattern wrong? I have also tried giving different patterns but nothing worked

I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int' Expected output: Once after cleaning and replacing with the old with the cleaned values the output would look like this.

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230
2
  • 2
    To be clear, you're trying to remove the digits from values in both columns? Could you post your expected output just to make that clearer? Commented Mar 26, 2021 at 19:32
  • @HS-nebula No, digits and special characters from column 'str' and strings(alphabets), special characters from column 'int'. Could you please check it again, I have edited and posted the excepted ouput. Commented Mar 26, 2021 at 19:39

2 Answers 2

2

You can do it with

df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings

df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])

Returns:

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230
Sign up to request clarification or add additional context in comments.

Comments

1

Try the following:

  1. '\D' represents any non digit value, substitute those with empty string '' in int column
  2. [^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
  3. Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)

Output:

     str   int
0    abc   123
1    gbc    23
2    abc   200
3    abc  1230
4  abcer   230

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.