parsing a string in python: how to split newlines while ignoring newline inside quotes

Question

I have a text that i need to parse in python.

It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.

for example:

abcd efgh ijk\n1234 567"qqqq\n---" 890\n

should be parsed into a list of the following lines:

abcd efgh ijk
1234 567"qqqq\n---" 890

I've tried to it with split('\n'), but i don't know how to ignore the quotes.

Any idea?

Thanks!

what if there's an odd number of quotes? e.g. foo"bar"oh"what? — Pavel
– Pavel, Commented Jun 3, 2014 at 15:06

georg · Accepted Answer · 2014-06-04 13:01:41Z

8

Here's a much easier solution.

Match groups of (?:"[^"]*"|.)+. Namely, "things in quotes or things that aren't newlines".

Example:

import re
re.findall('(?:"[^"]*"|.)+', text)

NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z).

The (?!\Z) is a confusing way to say "not the end of a string". The (?! ) is negative lookahead; the \Z is the "end of a string" part.

Tests:

import re

texts = (
    'text',
    '"text"',
    'text\ntext',
    '"text\ntext"',
    'text"text\ntext"text',
    'text"text\n"\ntext"text"',
    '"\n"\ntext"text"',
    '"\n"\n"\n"\n\n\n""\n"\n"'
)

line_matcher = re.compile('(?:"[^"]*"|.)+')

for text in texts:
    print("{:>27} → {}".format(
        text.replace("\n", "\\n"),
        " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
    ))

#>>>                        text → text
#>>>                      "text" → "text"
#>>>                  text\ntext → text [LINE] text
#>>>                "text\ntext" → "text\ntext"
#>>>        text"text\ntext"text → text"text\ntext"text
#>>>    text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>>            "\n"\ntext"text" → "\n" [LINE] text"text"
#>>>    "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"

edited Jun 4, 2014 at 13:01

georg

216k57 gold badges324 silver badges401 bronze badges

answered Jun 4, 2014 at 12:22

Veedrac

60.7k15 gold badges120 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Yuval Atzmon Over a year ago

Thanks for your solution! really short and elegant!

georg Over a year ago

Good job, compliments!

jakob.j Over a year ago

One of the most professional answers I have encountered so far! In case one wants to split at whitespaces (instead of newlines), one can use: (?:"[^"]*"|[^\s]+?)+

jakob.j Over a year ago

Curious: why does the regex need the non-matching group ((?:...)? Would it also work with a "normal" group?

njzk2 · Accepted Answer · 2014-06-04 13:15:53Z

4

You can split it, then reduce it to put together the elements that have an odd number of " :

txt = 'abcd efgh ijk\n1234 567"qqqq\n---" 890\n'
s = txt.split('\n')
reduce(lambda x, y: x[:-1] + [x[-1] + '\n' + y] if x[-1].count('"') % 2 == 1 else x + [y], s[1:], [s[0]])
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

Explication:

if x[-1].count('"') % 2 == 1
# If there is an odd number of quotes to the last handled element
x[:-1] + [x[-1] + y]
# Append y to this element
else x + [y]
# Else append the element to the handled list

Can also be written like so:

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    for item in s:
        if res and res[-1].count('"') % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

As pointed out by @Veedrac, this is O(n^2), but this can be prevented by keeping track of the count of ":

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    cnt = 0
    for item in s:
        if res and cnt % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
            cnt = 0
        cnt += item.count('"')
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

(The last empty string is because of the last \n at the end of the input string.)

edited Jun 4, 2014 at 13:15

answered Jun 3, 2014 at 15:19

njzk2

39.4k7 gold badges72 silver badges111 bronze badges

9 Comments

Veedrac Over a year ago

Eww... Also this is a quadratic time solution for an O(n) problem.

njzk2 Over a year ago

@Veedrac: it is quadratic indeed. It can easily be turned into linear by caching the count of " between iterations, though.

njzk2 Over a year ago

(although the string concatenation may still make it quadratic. researching on that)

Veedrac Over a year ago

@njzk2 In CPython there is an optimisation that could deal with that, but it's an implementation detail. You could in theory have a two-tier system (list of lists) and join the inner lists on the else, but it's a bit of a hassle.

Veedrac Over a year ago

@njzk2 I just tested it, and because it's in a list it's still O(n²). See the results here.

|

georg · Accepted Answer · 2014-06-03 16:03:52Z

Ok, this seems to work (assuming quotes are properly balanced):

rx = r"""(?x)
    \n
    (?!
        [^"]*
        "
        (?=
            [^"]*
            (?:
                " [^"]* "
                [^"]*
            )*
            $
        )
    )
"""

Test:

str = """\
first
second "qqq
     qqq
     qqq
     " line
"third
    line" AND "spam
        ham" AND "more
            quotes"
end \
"""

import re


for x in re.split(rx, str):
    print '[%s]' % x

Result:

[first]
[second "qqq
     qqq
     qqq
     " line]
["third
    line" AND "spam
        ham" AND "more
            quotes"]
[end ]

If the above looks too weird for you, you can also do this in two steps:

str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str)
lines = [x.replace('\x01', '\n') for x in str.splitlines()]

for line in lines:
    print '[%s]' % line  # same result

igortg · Accepted Answer · 2014-06-03 17:49:38Z

1

There are many ways to accomplish that. I came up with a very simple one:

splitted = [""]
for i, x in enumerate(re.split('"', text)):
    if i % 2 == 0:
        lines = x.split('\n')
        splitted[-1] += lines[0]
        splitted.extend(lines[1:])
    else:
        splitted[-1] += '"{0}"'.format(x)

answered Jun 3, 2014 at 17:49

igortg

1609 bronze badges

Collectives™ on Stack Overflow

parsing a string in python: how to split newlines while ignoring newline inside quotes

4 Answers 4

4 Comments

9 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

9 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related