Pandas dataframe: split long regex into multiple lines

Question

I'm using Pandas for some data cleanup, and I have a very long regex which I would like to split into multiple lines. The following works fine in Pandas because it is all on one line:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r'(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)\b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b', value = r'<\g<0>>', regex = True)

However, it is difficult to manage. I've tried the following verbose method which works in regular Python:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r"""(?=[^\])}]*([\[({]|$))
                                                      \b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
                                                      \b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?
                                                      (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b""", value = r'<\g<0>>', regex = True)

This does not work in Pandas, though. Any ideas what I'm missing?

Here is some sample text for testing:

GR, MDT, CMR, HLDS, NEXT, NGI @ 25273, COMPTG

FIT 13.72 ON 9-7/8 LNR, LWD[GR,RES,APWD,SONVIS], MDTS (PRESS & SAMP) ROT SWC, TSTG BOP

LWD[GR,RES,APWD,SONVIS], GR, RES, NGI, PPC @ 31937, MDTS (PRESS & SAMP) TKG ROT SWC

LWD[GR,RES] @ 12586, IND, FDC, CNL, GR @ 12586, SWC, RAN CSG, PF 12240-12252, RR (ADDED INFO)

Thanks!

It_is_Chris · Accepted Answer · 2021-01-12 20:10:37Z

1

One option is to create a list of strings and then use join when you call replace

RegEx = [r'(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)',
         r'\b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?',
         r'(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b']

df['REMARKS'] = df['REMARKS'].replace(to_replace=''.join(RegEx), value=r'<\g<0>>', regex=True)

Using re

import re

s = r"""(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
         \b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?
         (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b"""

df['REMARKS'] = df['REMARKS'].replace(to_replace=re.compile(s, re.VERBOSE), value=r'<\g<0>>')

edited Jan 12, 2021 at 20:10

answered Jan 12, 2021 at 17:09

It_is_Chris

14.2k3 gold badges27 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Heather Over a year ago

Thanks for the idea, Chris. It is strange, though, that we'd have to jump through these types of hoops in Pandas.

It_is_Chris Over a year ago

@Heather it is more of a regex issue. In python, you can use a \ to terminate a line and continue it on the next. However, \ in regex means something differnt so that is not an option. Also using triple quotes """ is not an option because each return will insert a \n

Heather Over a year ago

If that's the case, what about the recommendation to use the regex verbose method mentioned here stackoverflow.com/questions/33211404/…? It specifically mentions that the triple quotes denote verbose in regex.

It_is_Chris Over a year ago

@Heather if you want to use re.VERBOSE then you will need to import the re package and use re.compile on the string I will update my answer to reflect that as an option. Either way, you will still need to "compile" the string whether it is by using join or re.compile

Collectives™ on Stack Overflow

Pandas dataframe: split long regex into multiple lines

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related