Regex not working properly for some cases (python)?

Question

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)

   str     int
0   abc    123
1   gbc@*   23abc
2   abc123  abc200
3   124abc  1230&*
4   abcer£$%&*! 230!?*&

num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)

Output:

correct_int: ['23', '1230', '230']

As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first

correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']

now it removed for all but sometimes it's not removing when the value is '124abc'

Is my pattern wrong? I have also tried giving different patterns but nothing worked

I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int' Expected output: Once after cleaning and replacing with the old with the cleaned values the output would look like this.

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230

To be clear, you're trying to remove the digits from values in both columns? Could you post your expected output just to make that clearer? — m13op22
– m13op22, Commented Mar 26, 2021 at 19:32
@HS-nebula No, digits and special characters from column 'str' and strings(alphabets), special characters from column 'int'. Could you please check it again, I have edited and posted the excepted ouput. — Toi
– Toi, Commented Mar 26, 2021 at 19:39

m13op22 · Accepted Answer · 2021-03-26 19:48:38Z

2

You can do it with

df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings

df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])

Returns:

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230

answered Mar 26, 2021 at 19:48

m13op22

2,3792 gold badges20 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

marc_s · Accepted Answer · 2021-04-04 16:04:09Z

1

Try the following:

'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function

import pandas as pd
import re

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)

Output:

     str   int
0    abc   123
1    gbc    23
2    abc   200
3    abc  1230
4  abcer   230

edited Apr 4, 2021 at 16:04

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Mar 26, 2021 at 19:46

Abhi_J

2,1291 gold badge7 silver badges17 bronze badges

Collectives™ on Stack Overflow

Regex not working properly for some cases (python)?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related