2

I need to to a RegEx search and replace of all commas found inside of quote blocks.
i.e.

"thing1,blah","thing2,blah","thing3,blah",thing4  

needs to become

"thing1\,blah","thing2\,blah","thing3\,blah",thing4  

my code:

inFile  = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()

p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
    pg = p.search(line)
    # found comment block
    if pg:
        q  = re.compile(r'[^\\],')
        # found comma within comment block
        qg = q.search(pg.group(0))
        if qg:
            # Here I want to reconstitute the line and print it with the replaced text
            #print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))

I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.

How can I do this in Python?

6
  • not really an answer but before you reimplement one, maybe you could be better served by a CSV parser? That seems the format you are dealing with. Commented Oct 4, 2011 at 16:45
  • I'm actually looking to get the data ready for my custom CSV parser csv.register_dialect( 'escapedExcel' , delimiter = ',' , skipinitialspace = 0 , doublequote = 1 , quoting = csv.QUOTE_ALL , quotechar = '"' , lineterminator = '\r\n' , escapechar = '\\' ) Commented Oct 4, 2011 at 16:49
  • I see, then I believe you want to use the methods span and start of the match object to get at the stuff that was around it and recompose your line. But I am not sure why a single call to sub after the "selecting" loop would not be ok. Commented Oct 4, 2011 at 17:10
  • 3
    @Dragos Toader: Why would you want to replace commas inside quotes? csv.reader has no problems with commas inside quotes. Commented Oct 4, 2011 at 17:26
  • Adding backslashes just means yet another mechanism for your parser to cope with. Now you will need to backslash all backslashes, too. The proper fix is to teach your CSV parser to ignore commas inside double quotes, or use an existing CSV parser which does. Commented Oct 4, 2011 at 20:23

5 Answers 5

3

The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.

import csv
import StringIO

s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
    wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()

result:

"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"
Sign up to request clarification or add additional context in comments.

1 Comment

+1 But you need to write item.replace('\\,',',').replace(',','\\,') , otherwise "thing3\,blah " is replaced with "thing3\\,blah "
1

General Edit

There was

"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4   

in the question, and now it is not there anymore.

Moreover, I hadn't remarked r'[^\\],'.

So, I completely rewrite my answer.

"thing1,blah","thing2,blah","thing3,blah",thing4               

and

"thing1\,blah","thing2\,blah","thing3\,blah",thing4

being displays of strings (I suppose)

import re


ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '

regx = re.compile('"[^"]*"')

def repl(mat, ri = re.compile('(?<!\\\\),') ):
    return ri.sub('\\\\',mat.group())

print ss
print repr(ss)
print
print      regx.sub(repl, ss)
print repr(regx.sub(repl, ss))

result

"thing1,blah","thing2,blah","thing3\,blah",thing4 
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '

"thing1\blah","thing2\blah","thing3\,blah",thing4 
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '

1 Comment

This answer has been upvoted. I would like to know why. I'm also perplexed by the fact that my rep is then diminished of 1 and not 2 points !
0

You can try this regex.


>>> re.sub('(?<!"),(?!")', r"\\,", 
                     '"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4

The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "

8 Comments

Your solution is better than mine. You just need to write the pattern '([^"]+) *, *([^"]+)' or even '([^"]+)[\t ]*,[\t ]*([^"]+)' in case a comma is between blanks
Added the checks you mentioned. Thanks!
How does this work with strings with two commas between the quotes? "thing1,blah,moreblah"
@StevenRumbalski Yes, it does not work in that case. Both lookahead and lookbehind have to be used in that scenario. I will see if i can make those changes.
How about re.sub('".+?"', lambda m: m.group(0).replace(',','\\,'), '"th,ing1,blah","thing2,""blah""","thing3,blah",thing4')?
|
0

I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?

outfile  = open(outfileName,'w')

p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
    pg = p.finditer(line)
    pglen = len(p.findall(line))

    if pglen > 0:
        mpgstart = 0;
        mpgend   = 0;

        for i,mpg in enumerate(pg):
            if i == 0:
                outfile.write(line[:mpg.start()])

            qg    = q.finditer(mpg.group(0))
            qglen = len(q.findall(mpg.group(0)))

            if i > 0 and i < pglen:
                outfile.write(line[mpgend:mpg.start()])

            if qglen > 0:
                for j,mqg in enumerate(qg):
                    if j == 0:
                        outfile.write( mpg.group(0)[:mqg.start()]    )

                    outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )

                    if j == (qglen-1):
                        outfile.write( mpg.group(0)[mqg.end():]      )
            else:
                outfile.write(mpg.group(0))

            if i == (pglen-1):
                outfile.write(line[mpg.end():])

            mpgstart = mpg.start()
            mpgend   = mpg.end()
    else:
        outfile.write(line)

outfile.close()

2 Comments

Your code is incredibly tortuous: you are using regexes to do just a tiny part of what the regexes are for, in order to obtain elements of string on which you perform treatments thanks to string methods that are precisely what is done with regexes. By the way, you didn't remarked that your code gives a false result for "thing3,blah" that is transformed into "thingx\x03,blah" , I don't know how.
Also, by the way, did you, at any moment, remarked that there were answers and debates concerning the question you precisely put to obtain answers ? You doesn't do the slightestt allusion to other answers that we hoped helpful. Instead of that, you show a code without any interest, as if you haven't read the answers. I find this a little unfair.
0

have you looked into str.replace()?

str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

here is some documentation

hope this helps

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.