3

I have a data frame with 4 columns, col4 is a string including texts and digits:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1    PD380_003 %LINK-3-UPDOWN
Syslog        2016,09,17           1    NM380_005 %BGP-5-NBR_RESET
Syslog        2016,09,14           1    NM380_005 %BGP-5-NBR_RESET
Syslog        2016,09,08           1    DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config

i need to keep a substring of that column and delete anything else, so I used regex and i made a pattern but when i run the following query the result is not what i want, it replace everything with the pattern itself:

data.replace({'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}},regex=True)

desired result is:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1           PD380_003
Syslog        2016,09,17           1           NM380_005
Syslog        2016,09,14           1           LO380_004
Syslog        2016,09,08           1           LO380_004

but the result i get is like:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,17           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,14           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,08           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})

what am i doing wrong?

2
  • can you post your data DF before replacing? Commented Oct 10, 2016 at 19:31
  • yes please check it out again. Commented Oct 10, 2016 at 19:34

3 Answers 3

3

First, you have the wrong regex's in the wrong positions. The to_replace argument to .replace needs to match what to replace and what to delete. So you need a ^.* in front of and a .*$ behind your regex in this case since you want to trim the string outside the match:

^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$

Demo

Second, the replace argument, if a regex, needs to be a capturing group or fixed string. In this case \1 will do.

Last, the Series form of .replace has a littler simpler syntax (at least for me) to understand.

So given:

>>> df
     Col1        Col2  Col3                                            Col4
0  SysLog  2016,09,17     1                        PD380_003 %LINK-3-UPDOWN
1  SysLog  2016,09,17     1                      NM380_005 %BGP-5-NBR_RESET
2  SysLog  2016,09,17     1                      NM380_005 %BGP-5-NBR_RESET
3  SysLog  2016,09,17     1  DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config

You can do:

>>> df['Col4'].replace(to_replace='^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', value=r'\1', regex=True) 
0    PD380_003
1    NM380_005
2    NM380_005
3    LO380_004
Name: Col4, dtype: object

You can also use a positional argument version if easier:

df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)

but you need to have regex=True since the replacement string is to be interpreted as a regex -- not just a static string.

Finally, assign directly into the original:

>>> df['Col4']=df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)
>>> df
     Col1        Col2  Col3       Col4
0  SysLog  2016,09,17     1  PD380_003
1  SysLog  2016,09,17     1  NM380_005
2  SysLog  2016,09,17     1  NM380_005
3  SysLog  2016,09,17     1  LO380_004
Sign up to request clarification or add additional context in comments.

1 Comment

thanks for explaining i'm new to Panda this kid of explanation helps :)
3

I think you need extract:

data.Col4 = data.Col4.str.extract('([A-Z]{2}[0-9]{3}_[0-9]{3})', expand=False)

print (data)
     Col1        Col2  Col3       Col4
0  Syslog  2016,09,17     1  PD380_003
1  Syslog  2016,09,17     1  NM380_005
2  Syslog  2016,09,14     1  NM380_005
3  Syslog  2016,09,08     1  LO380_004

3 Comments

yes it works but i'm just wondering why replace can't read the regex?
Hmm, I check docs and it can works.
But I think it is bug.
1

You were using RegEx in a wrong way.

{'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}} - means replace a whatever string in the Col4 column with '([A-Z]{2}[0-9]{3}_[0-9]{3})'

Try this:

In [87]: df.replace({'Col4':{r'.*?([A-Z]{2}[0-9]{3}_[0-9]{3}).*':r'\1'}}, regex=True)
Out[87]:
     Col1        Col2  Col3       Col4
0  Syslog  2016,09,17     1  PD380_003
1  Syslog  2016,09,17     1  NM380_005
2  Syslog  2016,09,14     1  NM380_005
3  Syslog  2016,09,08     1  LO380_004

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.