24

Hey I'm trying to figure out a regular expression to do the following.

Here is my string

Place,08/09/2010,"15,531","2,909",650

I need to split this string by the comma's. Though due to the comma's used in the numerical data fields the split doesn't work correctly. So I want to remove the comma's in the numbers before running splitting the string.

Thanks.

1
  • If none of the answers are fitting enough, could you point out what you're missing still? Commented Nov 17, 2011 at 19:35

6 Answers 6

49
new_string = re.sub(r'"(\d+),(\d+)"', r'\1.\2', original_string)

This will substitute the , inside the quotes with a . and you can now just use the strings split method.

Sign up to request clarification or add additional context in comments.

8 Comments

...unless the number is "123,456,789" - then it will be "123.456,789"
good point, I considered only the case where the , was the radix point in a valid number.
Thank, yours is the correct solution I was looking for. Cheers!
@Ciarán Careful to not have any number greater than 999,999 inside the quotes...
The expression offered does not match "999,999,999" at all - you would need something with lookaround, like: "(\d+),(?=\d\d\d(,\d\d\d)*") (which should be replaced with "\1)
|
26
>>> from StringIO import StringIO
>>> import csv
>>> r = csv.reader(StringIO('Place,08/09/2010,"15,531","2,909",650'))
>>> r.next()
['Place', '08/09/2010', '15,531', '2,909', '650']

2 Comments

Though actually I now get this error Traceback (most recent call last): , line 19, in <module> from StringIO import StringIO ImportError: No module named StringIO
In Python 3 it's io.StringIO: docs.python.org/py3k/library/…
1

Another way of doing it using regex directly:

>>> import re
>>> data = "Place,08/09/2010,\"15,531\",\"2,909\",650"
>>> res = re.findall(r"(\w+),(\d{2}/\d{2}/\d{4}),\"([\d,]+)\",\"([\d,]+)\",(\d+)", data)
>>> res
[('Place', '08/09/2010', '15,531', '2,909', '650')]

Comments

1

You could parse a string of that format using pyparsing:

import pyparsing as pp
import datetime as dt

st='Place,08/09/2010,"15,531","2,909",650'

def line_grammar():
    integer=pp.Word(pp.nums).setParseAction(lambda s,l,t: [int(t[0])])
    sep=pp.Suppress('/')
    date=(integer+sep+integer+sep+integer).setParseAction(
              lambda s,l,t: dt.date(t[2],t[1],t[0]))
    comma=pp.Suppress(',')
    quoted=pp.Regex(r'("|\').*?\1').setParseAction(
              lambda s,l,t: [int(e) for e in t[0].strip('\'"').split(',')])
    line=pp.Word(pp.alphas)+comma+date+comma+quoted+comma+quoted+comma+integer
    return line

line=line_grammar()
print(line.parseString(st))
# ['Place', datetime.date(2010, 9, 8), 15, 531, 2, 909, 650]

The advantage is you parse, convert, and validate in a few lines. Note that the ints are all converted to ints and the date to a datetime structure.

Comments

0
a = """Place,08/09/2010,"15,531","2,909",650""".split(',')
result = []
i=0
while i<len(a):
    if not "\"" in a[i]:
        result.append(a[i])
    else:
        string = a[i]
        i+=1
        while True:
            string += ","+a[i]
            if "\"" in a[i]:
                break
            i+=1
        result.append(string)
    i+=1
print result

Result:
['Place', '08/09/2010', '"15,531"', '"2,909"', '650']
Not a big fan of regular expressions unless you absolutely need them

Comments

-1

If you need a regex solution, this should do:

r"(\d+),(?=\d\d\d)"

then replace with:

"\1"

It will replace any comma-delimited numbers anywhere in your string with their number-only equivalent, thus turning this:

Place,08/09/2010,"15,531","548,122,909",650

into this:

Place,08/09/2010,"15531","548122909",650

I'm sure there are a few holes to be found and places you don't want this done, and that's why you should use a parser!

Good luck!

2 Comments

Does the downvoter dare to tell me why my solution doesn't answer the question?
The question is very unclear, but I think it was really asking how to ignore commas in double quotes (that is, how to parse a line of a csv file) and the example given was just one example of something that got parsed wrong by naive comma-splitting. The regexp fails on cases like '1000,1234' as well as ignoring double quotes. The downvoter wasn't me though, so consider this just my guess as to his or her motivation. The poster who suggested using the csv module clearly gave the right answer I think.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.