5

I have a text that i need to parse in python.

It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.

for example:

abcd efgh ijk\n1234 567"qqqq\n---" 890\n

should be parsed into a list of the following lines:

abcd efgh ijk
1234 567"qqqq\n---" 890

I've tried to it with split('\n'), but i don't know how to ignore the quotes.

Any idea?

Thanks!

2
  • 4
    what if there's an odd number of quotes? e.g. foo"bar"oh"what? Commented Jun 3, 2014 at 15:06
  • The number of quotes is even Commented Jun 4, 2014 at 13:44

4 Answers 4

8

Here's a much easier solution.

Match groups of (?:"[^"]*"|.)+. Namely, "things in quotes or things that aren't newlines".

Example:

import re
re.findall('(?:"[^"]*"|.)+', text)

NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z).

The (?!\Z) is a confusing way to say "not the end of a string". The (?! ) is negative lookahead; the \Z is the "end of a string" part.


Tests:

import re

texts = (
    'text',
    '"text"',
    'text\ntext',
    '"text\ntext"',
    'text"text\ntext"text',
    'text"text\n"\ntext"text"',
    '"\n"\ntext"text"',
    '"\n"\n"\n"\n\n\n""\n"\n"'
)

line_matcher = re.compile('(?:"[^"]*"|.)+')

for text in texts:
    print("{:>27} → {}".format(
        text.replace("\n", "\\n"),
        " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
    ))

#>>>                        text → text
#>>>                      "text" → "text"
#>>>                  text\ntext → text [LINE] text
#>>>                "text\ntext" → "text\ntext"
#>>>        text"text\ntext"text → text"text\ntext"text
#>>>    text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>>            "\n"\ntext"text" → "\n" [LINE] text"text"
#>>>    "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your solution! really short and elegant!
Good job, compliments!
One of the most professional answers I have encountered so far! In case one wants to split at whitespaces (instead of newlines), one can use: (?:"[^"]*"|[^\s]+?)+
Curious: why does the regex need the non-matching group ((?:...)? Would it also work with a "normal" group?
4

You can split it, then reduce it to put together the elements that have an odd number of " :

txt = 'abcd efgh ijk\n1234 567"qqqq\n---" 890\n'
s = txt.split('\n')
reduce(lambda x, y: x[:-1] + [x[-1] + '\n' + y] if x[-1].count('"') % 2 == 1 else x + [y], s[1:], [s[0]])
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

Explication:

if x[-1].count('"') % 2 == 1
# If there is an odd number of quotes to the last handled element
x[:-1] + [x[-1] + y]
# Append y to this element
else x + [y]
# Else append the element to the handled list

Can also be written like so:

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    for item in s:
        if res and res[-1].count('"') % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

As pointed out by @Veedrac, this is O(n^2), but this can be prevented by keeping track of the count of ":

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    cnt = 0
    for item in s:
        if res and cnt % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
            cnt = 0
        cnt += item.count('"')
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

(The last empty string is because of the last \n at the end of the input string.)

9 Comments

Eww... Also this is a quadratic time solution for an O(n) problem.
@Veedrac: it is quadratic indeed. It can easily be turned into linear by caching the count of " between iterations, though.
(although the string concatenation may still make it quadratic. researching on that)
@njzk2 In CPython there is an optimisation that could deal with that, but it's an implementation detail. You could in theory have a two-tier system (list of lists) and join the inner lists on the else, but it's a bit of a hassle.
@njzk2 I just tested it, and because it's in a list it's still O(n²). See the results here.
|
1

Ok, this seems to work (assuming quotes are properly balanced):

rx = r"""(?x)
    \n
    (?!
        [^"]*
        "
        (?=
            [^"]*
            (?:
                " [^"]* "
                [^"]*
            )*
            $
        )
    )
"""

Test:

str = """\
first
second "qqq
     qqq
     qqq
     " line
"third
    line" AND "spam
        ham" AND "more
            quotes"
end \
"""

import re


for x in re.split(rx, str):
    print '[%s]' % x

Result:

[first]
[second "qqq
     qqq
     qqq
     " line]
["third
    line" AND "spam
        ham" AND "more
            quotes"]
[end ]

If the above looks too weird for you, you can also do this in two steps:

str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str)
lines = [x.replace('\x01', '\n') for x in str.splitlines()]

for line in lines:
    print '[%s]' % line  # same result

Comments

1

There are many ways to accomplish that. I came up with a very simple one:

splitted = [""]
for i, x in enumerate(re.split('"', text)):
    if i % 2 == 0:
        lines = x.split('\n')
        splitted[-1] += lines[0]
        splitted.extend(lines[1:])
    else:
        splitted[-1] += '"{0}"'.format(x)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.