1

I need to split a string by space or by comma. But it should leave single or double quoted strings as it is. Even if it is apart by many spaces or a single space it makes no difference. For e.g.:

    """ 1,' unchanged 1' " unchanged  2 "   2.009,-2e15 """

should return

    """ 1,' unchanged 1'," unchanged  2 ",2.009,-2e15 """

There may be no or more spaces before and after a comma. Those spaces are to be ignored. In this particular context, as shown in the ex string, if two quoted or double quoted strings happen to be next to each other, they will have a space in between or a comma.

I have a previous question at python reg ex to include missing commas, however, for that to work a splitting comma should have a space after.

7
  • So things in quotes, like unchanged 1 and unchanged 2, will always have a comma before and after them? nvm I see, either a space or a comma. Commented Jun 26, 2015 at 21:01
  • it should be one or more spaces or a comma. if it is a comma, there may be (no or) several spaces before and after the comma. It shouldn't be next to each other without a comma or a space. Commented Jun 26, 2015 at 21:49
  • In other words, you want to replace each group of "separator" spaces with a comma? That's how I understood your output string. And also, are "inner strings" always inside quotes or just if they contain spaces? Commented Jun 27, 2015 at 2:58
  • yes NZP. Inner strings are always quoted (double or single). Commented Jun 27, 2015 at 15:21
  • Great, then the one in my answer will work, if there are no additional gotchas in the string. One additional thing that occurred to me: can decimal numbers start with a dot, as in ".134" and omit the leading zero? Commented Jun 27, 2015 at 15:32

1 Answer 1

1

Edit: previous versions clobbered the newline that would, I assume, be in the file. Fixed now.

This is probably too much on the "if in doubt, use brute force" side, but it works:

regex = r"""(?<=["'])[^\S\n]+(?=["'])|(?<=["'])[^\S\n]+(?=\d)|(?<=\d)[^\S\n]+(?=\d|\.\d)|(?<=(?<=\w|\d)\d)[^\S\n]+(?=["'])|(?<=["'\d])[^\S\n]*,[^\S\n]*"""

It leaves commas inside strings, and handles numbers with a leading dot.

To get the output you want:

re.sub(regex, ",", original_string)

For a rough idea of performance [1], on an Ivy Bridge Celeron

import timeit

s = """\
import re

s = \"\"\"1,' unchanged 1' " unchanged  2 "   2.009,-2e15 35  "  fad!" '   dfgsdfg ' ,   'asdfasdf'  " fasf ,  , asfa" "2 fs", .085     .835\"\"\"
rgex = re.compile(r\"\"\"(?<=["'])\s+(?=["'])|(?<=["'])\s+(?=\d)|(?<=\d)\s+(?=\d|\.\d)|(?<=(?<=\w|\d)\d)\s+(?=["'])|(?<=["'\d])\s*,\s*\"\"\")

re.sub(rgex, ",", s)

"""

print("1k iterations: ", timeit.timeit(stmt=s, number=1000))
print("10k iterations: ", timeit.timeit(stmt=s, number=10000))
print("100k iterations: ", timeit.timeit(stmt=s, number=100000))
print("200k iterations: ", timeit.timeit(stmt=s, number=200000))
print("300k iterations: ", timeit.timeit(stmt=s, number=300000))

gives:

1k iterations:  0.0494868220000626
10k iterations:  0.4617418729999372
100k iterations:  4.604098313999884
200k iterations:  9.197777003000056
300k iterations:  13.79744054799994.

Interestingly, with the regex module, which is supposed to be more performant (as far as I understood), and which is supposed to replace the standard library re some time in the future, it's roughly two times slower.

[1]: It's not a realistic test as it just iterates on the string over and over, but I was in a hurry. Later tried a little better, with a string consisting of 200.000 and 300.000 lines (of the same string) and it came out roughly the same. ~8 seconds for 200.000 and ~12 seconds for 300.000.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.