replacing a column in DataFrame using regex

Question

I have a data frame with 4 columns, col4 is a string including texts and digits:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1    PD380_003 %LINK-3-UPDOWN
Syslog        2016,09,17           1    NM380_005 %BGP-5-NBR_RESET
Syslog        2016,09,14           1    NM380_005 %BGP-5-NBR_RESET
Syslog        2016,09,08           1    DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config

i need to keep a substring of that column and delete anything else, so I used regex and i made a pattern but when i run the following query the result is not what i want, it replace everything with the pattern itself:

data.replace({'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}},regex=True)

desired result is:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1           PD380_003
Syslog        2016,09,17           1           NM380_005
Syslog        2016,09,14           1           LO380_004
Syslog        2016,09,08           1           LO380_004

but the result i get is like:

 Col1           Col2              Col3           Col4
Syslog        2016,09,17           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,17           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,14           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})
Syslog        2016,09,08           1    ([A-Z]{2}[0-9]{3}_[0-9]{3})

what am i doing wrong?

can you post your data DF before replacing?

MaxU - stand with Ukraine
– MaxU - stand with Ukraine

2016-10-10 19:31:40 +00:00
Commented Oct 10, 2016 at 19:31 — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Oct 10, 2016 at 19:31
yes please check it out again.

faranak777
– faranak777

2016-10-10 19:34:52 +00:00
Commented Oct 10, 2016 at 19:34 — faranak777
– faranak777, Commented Oct 10, 2016 at 19:34

dawg · Accepted Answer · 2016-10-10 21:17:18Z

First, you have the wrong regex's in the wrong positions. The to_replace argument to .replace needs to match what to replace and what to delete. So you need a ^.* in front of and a .*$ behind your regex in this case since you want to trim the string outside the match:

^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$

Demo

Second, the replace argument, if a regex, needs to be a capturing group or fixed string. In this case \1 will do.

Last, the Series form of .replace has a littler simpler syntax (at least for me) to understand.

So given:

>>> df
     Col1        Col2  Col3                                            Col4
0  SysLog  2016,09,17     1                        PD380_003 %LINK-3-UPDOWN
1  SysLog  2016,09,17     1                      NM380_005 %BGP-5-NBR_RESET
2  SysLog  2016,09,17     1                      NM380_005 %BGP-5-NBR_RESET
3  SysLog  2016,09,17     1  DO NOT TICKET LO380_004 %SYS-5-CONFIG_I Config

You can do:

>>> df['Col4'].replace(to_replace='^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', value=r'\1', regex=True) 
0    PD380_003
1    NM380_005
2    NM380_005
3    LO380_004
Name: Col4, dtype: object

You can also use a positional argument version if easier:

df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)

but you need to have regex=True since the replacement string is to be interpreted as a regex -- not just a static string.

Finally, assign directly into the original:

>>> df['Col4']=df['Col4'].replace('^.*([A-Z]{2}[0-9]{3}_[0-9]{3}).*$', r'\1', regex=True)
>>> df
     Col1        Col2  Col3       Col4
0  SysLog  2016,09,17     1  PD380_003
1  SysLog  2016,09,17     1  NM380_005
2  SysLog  2016,09,17     1  NM380_005
3  SysLog  2016,09,17     1  LO380_004

thanks for explaining i'm new to Panda this kid of explanation helps :)

jezrael · Accepted Answer · 2016-10-10 19:48:18Z

3

I think you need extract:

data.Col4 = data.Col4.str.extract('([A-Z]{2}[0-9]{3}_[0-9]{3})', expand=False)

print (data)
     Col1        Col2  Col3       Col4
0  Syslog  2016,09,17     1  PD380_003
1  Syslog  2016,09,17     1  NM380_005
2  Syslog  2016,09,14     1  NM380_005
3  Syslog  2016,09,08     1  LO380_004

edited Oct 10, 2016 at 19:48

answered Oct 10, 2016 at 19:38

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

3 Comments

faranak777 Over a year ago

yes it works but i'm just wondering why replace can't read the regex?

jezrael Over a year ago

Hmm, I check docs and it can works.

jezrael Over a year ago

But I think it is bug.

MaxU - stand with Ukraine · Accepted Answer · 2016-10-10 20:24:04Z

1

You were using RegEx in a wrong way.

{'Col4':{'.*':'([A-Z]{2}[0-9]{3}_[0-9]{3})'}} - means replace a whatever string in the Col4 column with '([A-Z]{2}[0-9]{3}_[0-9]{3})'

Try this:

In [87]: df.replace({'Col4':{r'.*?([A-Z]{2}[0-9]{3}_[0-9]{3}).*':r'\1'}}, regex=True)
Out[87]:
     Col1        Col2  Col3       Col4
0  Syslog  2016,09,17     1  PD380_003
1  Syslog  2016,09,17     1  NM380_005
2  Syslog  2016,09,14     1  NM380_005
3  Syslog  2016,09,08     1  LO380_004

answered Oct 10, 2016 at 20:24

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Collectives™ on Stack Overflow

replacing a column in DataFrame using regex

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related