0

I'm using Pandas for some data cleanup, and I have a very long regex which I would like to split into multiple lines. The following works fine in Pandas because it is all on one line:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r'(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)\b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b', value = r'<\g<0>>', regex = True)

However, it is difficult to manage. I've tried the following verbose method which works in regular Python:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r"""(?=[^\])}]*([\[({]|$))
                                                      \b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
                                                      \b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?
                                                      (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b""", value = r'<\g<0>>', regex = True)

This does not work in Pandas, though. Any ideas what I'm missing?

Here is some sample text for testing:

GR, MDT, CMR, HLDS, NEXT, NGI @ 25273, COMPTG

FIT 13.72 ON 9-7/8 LNR, LWD[GR,RES,APWD,SONVIS], MDTS (PRESS & SAMP) ROT SWC, TSTG BOP

LWD[GR,RES,APWD,SONVIS], GR, RES, NGI, PPC @ 31937, MDTS (PRESS & SAMP) TKG ROT SWC

LWD[GR,RES] @ 12586, IND, FDC, CNL, GR @ 12586, SWC, RAN CSG, PF 12240-12252, RR (ADDED INFO)

Thanks!

1 Answer 1

1

One option is to create a list of strings and then use join when you call replace

RegEx = [r'(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)',
         r'\b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?',
         r'(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b']

df['REMARKS'] = df['REMARKS'].replace(to_replace=''.join(RegEx), value=r'<\g<0>>', regex=True)

Using re

import re

s = r"""(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
         \b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?
         (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b"""

df['REMARKS'] = df['REMARKS'].replace(to_replace=re.compile(s, re.VERBOSE), value=r'<\g<0>>')
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the idea, Chris. It is strange, though, that we'd have to jump through these types of hoops in Pandas.
@Heather it is more of a regex issue. In python, you can use a \ to terminate a line and continue it on the next. However, \ in regex means something differnt so that is not an option. Also using triple quotes """ is not an option because each return will insert a \n
If that's the case, what about the recommendation to use the regex verbose method mentioned here stackoverflow.com/questions/33211404/…? It specifically mentions that the triple quotes denote verbose in regex.
@Heather if you want to use re.VERBOSE then you will need to import the re package and use re.compile on the string I will update my answer to reflect that as an option. Either way, you will still need to "compile" the string whether it is by using join or re.compile

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.