Python RegEx nested search and replace

Question

I need to to a RegEx search and replace of all commas found inside of quote blocks.
i.e.

"thing1,blah","thing2,blah","thing3,blah",thing4

needs to become

"thing1\,blah","thing2\,blah","thing3\,blah",thing4

my code:

inFile  = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()

p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
    pg = p.search(line)
    # found comment block
    if pg:
        q  = re.compile(r'[^\\],')
        # found comma within comment block
        qg = q.search(pg.group(0))
        if qg:
            # Here I want to reconstitute the line and print it with the replaced text
            #print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))

I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.

How can I do this in Python?

not really an answer but before you reimplement one, maybe you could be better served by a CSV parser? That seems the format you are dealing with. — riffraff
– riffraff, Commented Oct 4, 2011 at 16:45
I'm actually looking to get the data ready for my custom CSV parser csv.register_dialect( 'escapedExcel' , delimiter = ',' , skipinitialspace = 0 , doublequote = 1 , quoting = csv.QUOTE_ALL , quotechar = '"' , lineterminator = '\r\n' , escapechar = '\\' ) — user78706
– user78706, Commented Oct 4, 2011 at 16:49
I see, then I believe you want to use the methods span and start of the match object to get at the stuff that was around it and recompose your line. But I am not sure why a single call to sub after the "selecting" loop would not be ok. — riffraff
– riffraff, Commented Oct 4, 2011 at 17:10
@Dragos Toader: Why would you want to replace commas inside quotes? csv.reader has no problems with commas inside quotes. — Steven Rumbalski
– Steven Rumbalski, Commented Oct 4, 2011 at 17:26
Adding backslashes just means yet another mechanism for your parser to cope with. Now you will need to backslash all backslashes, too. The proper fix is to teach your CSV parser to ignore commas inside double quotes, or use an existing CSV parser which does. — tripleee
– tripleee, Commented Oct 4, 2011 at 20:23

Steven Rumbalski · Accepted Answer · 2011-10-04 19:57:36Z

3

The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.

import csv
import StringIO

s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
    wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()

result:

"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"

edited Oct 4, 2011 at 19:57

answered Oct 4, 2011 at 17:57

Steven Rumbalski

45.7k10 gold badges96 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

eyquem Over a year ago

+1 But you need to write item.replace('\\,',',').replace(',','\\,') , otherwise "thing3\,blah " is replaced with "thing3\\,blah "

eyquem · Accepted Answer · 2011-10-04 17:50:36Z

1

General Edit

There was

"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4

in the question, and now it is not there anymore.

Moreover, I hadn't remarked r'[^\\],'.

So, I completely rewrite my answer.

"thing1,blah","thing2,blah","thing3,blah",thing4

and

"thing1\,blah","thing2\,blah","thing3\,blah",thing4

being displays of strings (I suppose)

import re


ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '

regx = re.compile('"[^"]*"')

def repl(mat, ri = re.compile('(?<!\\\\),') ):
    return ri.sub('\\\\',mat.group())

print ss
print repr(ss)
print
print      regx.sub(repl, ss)
print repr(regx.sub(repl, ss))

result

"thing1,blah","thing2,blah","thing3\,blah",thing4 
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '

"thing1\blah","thing2\blah","thing3\,blah",thing4 
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '

edited Oct 4, 2011 at 17:50

answered Oct 4, 2011 at 17:14

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

1 Comment

eyquem Over a year ago

This answer has been upvoted. I would like to know why. I'm also perplexed by the fact that my rep is then diminished of 1 and not 2 points !

Narendra Yadala · Accepted Answer · 2011-10-04 18:28:30Z

0

You can try this regex.


>>> re.sub('(?<!"),(?!")', r"\\,", 
                     '"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4

The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "

edited Oct 4, 2011 at 18:28

answered Oct 4, 2011 at 17:17

Narendra Yadala

9,6641 gold badge31 silver badges44 bronze badges

8 Comments

eyquem Over a year ago

Your solution is better than mine. You just need to write the pattern '([^"]+) *, *([^"]+)' or even '([^"]+)[\t ]*,[\t ]*([^"]+)' in case a comma is between blanks

Narendra Yadala Over a year ago

Added the checks you mentioned. Thanks!

Steven Rumbalski Over a year ago

How does this work with strings with two commas between the quotes? "thing1,blah,moreblah"

Narendra Yadala Over a year ago

@StevenRumbalski Yes, it does not work in that case. Both lookahead and lookbehind have to be used in that scenario. I will see if i can make those changes.

Steven Rumbalski Over a year ago

How about re.sub('".+?"', lambda m: m.group(0).replace(',','\\,'), '"th,ing1,blah","thing2,""blah""","thing3,blah",thing4')?

|

user78706 · Accepted Answer · 2011-10-04 20:16:15Z

0

I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?

outfile  = open(outfileName,'w')

p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
    pg = p.finditer(line)
    pglen = len(p.findall(line))

    if pglen > 0:
        mpgstart = 0;
        mpgend   = 0;

        for i,mpg in enumerate(pg):
            if i == 0:
                outfile.write(line[:mpg.start()])

            qg    = q.finditer(mpg.group(0))
            qglen = len(q.findall(mpg.group(0)))

            if i > 0 and i < pglen:
                outfile.write(line[mpgend:mpg.start()])

            if qglen > 0:
                for j,mqg in enumerate(qg):
                    if j == 0:
                        outfile.write( mpg.group(0)[:mqg.start()]    )

                    outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )

                    if j == (qglen-1):
                        outfile.write( mpg.group(0)[mqg.end():]      )
            else:
                outfile.write(mpg.group(0))

            if i == (pglen-1):
                outfile.write(line[mpg.end():])

            mpgstart = mpg.start()
            mpgend   = mpg.end()
    else:
        outfile.write(line)

outfile.close()

answered Oct 4, 2011 at 20:16

user78706

2 Comments

eyquem Over a year ago

Your code is incredibly tortuous: you are using regexes to do just a tiny part of what the regexes are for, in order to obtain elements of string on which you perform treatments thanks to string methods that are precisely what is done with regexes. By the way, you didn't remarked that your code gives a false result for "thing3,blah" that is transformed into "thingx\x03,blah" , I don't know how.

eyquem Over a year ago

Also, by the way, did you, at any moment, remarked that there were answers and debates concerning the question you precisely put to obtain answers ? You doesn't do the slightestt allusion to other answers that we hoped helpful. Instead of that, you show a code without any interest, as if you haven't read the answers. I find this a little unfair.

Code Monkey · Accepted Answer · 2011-10-04 21:18:54Z

0

have you looked into str.replace()?

str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

here is some documentation

hope this helps

answered Oct 4, 2011 at 21:18

Code Monkey

3191 silver badge7 bronze badges

Collectives™ on Stack Overflow

Python RegEx nested search and replace

5 Answers 5

1 Comment

1 Comment

8 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

8 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related