5

I am looking for a regular exression matching spaces only if thos spaces are not enclosed in double quotes ("). For example, in

Mary had "a little lamb"

it should match the first an the second space, but not the others.

I want to split the string only at the spaces which are not in the double quotes, and not at the quotes.

I am using C++ with the Qt toolkit and wanted to use QString::split(QRegExp). QString is very similar to std::string and QRegExp are basically POSIX regex encapsulated in a class. If there exist such a regex, the split would be trivial.

Examples:

Mary had "a little lamb"     =>   Mary,had,"a little lamb"
1" 2 "3                      =>   1" 2 "3 (no splitting at ")
abc def="g h i" "j k" = 12   =>   abc,def="g h i","j k",=,12

Sorry for the edits, I was very imprecise when I asked the question first. Hope it is somewhat more clear now.

1

5 Answers 5

9

(I know you just posted almost exactly the same answer yourself, but I can't bear to just throw all this away. :-/)

If it's possible to solve your problem with a regex split operation, the regex will have to match even numbers of quotation marks, as MSalters said. However, a split regex should match only the spaces you're splitting on, so the rest of the work has to be done in a lookahead. Here's what I would use:

" +(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

If the text is well formed, a lookahead for an even number of quotes is sufficient to determine that the just-matched space is not inside a quoted sequence. That is, lookbehinds aren't necessary, which is good because QRegExp doesn't seem to support them. Escaped quotes can be accommodated too, but the regex becomes quite a bit larger and uglier. But if you can't be sure the text is well formed, it's extremely unlikely you'll be able to solve your problem with split().

By the way, QRegExp does not implement POSIX regular expressions--if it did, it wouldn't support lookaheads OR lookbehinds. Instead, it falls into the loosely-defined category of Perl-compatible regex flavors.

Sign up to request clarification or add additional context in comments.

2 Comments

Well, I wrote basically POSIX ;-) When I searched for lookbehinds, I noticed that there was something missing. But shortly after, I noticed I didn't even need them, assuming always paired quotes. Maybe I should file a bug at Qt. Anyway, +1 for the improvment of the regex.
I realize you were using the term "POSIX" loosely; I was merely pointing out others might not. I had to look up QRegExp to be sure it wasn't POSIX-standard (or, more specifically, that I could recommend a lookahead-based solution).
4

What should happen to "a" b "c" ?

Note that in the substring " b " the spaces are between quotes.

-- edit --

I assume a space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”).

That means you need the following regex: ^[^"]*("[^"]*"[^"]*)*"[^"]* [^"]*"[^"]*("[^"]*"[^"]*)*$

("[^"]*"[^"]*) represents a pair of quotes. ("[^"]*"[^"]*)* is an even amount of quotes, ("[^"]"[^"]*)*" an odd amount. Then there's the actual quoted string part, followed by another odd number of quotes. ^$ anchors are needed because you need to count every quote from the beginning of the string. This answers the " b " substring problem above by never looking at substrings. The price is that every character in your input must be matched against the entire string, which turns this into an O(N*N) split operation.

The reason why you can do this in a regex is because there is a finite amount of memory needed. Effectively just one bit; "have I seen an odd or even number of quotes so far?". You don't actually have to match up individual "" pairs.

This is not the only interpretation possible, though. If you do include “funny Unicode quotes” which should be paired, you also need to deal with ““double quoted”” strings. This in turn means you need a count of open , which means you need infinite storage, which in turns means it's no longer a regular language, which means you can't use a regex. QED.

Anyway, even if it was possible, you still would want a proper parser. The O(N*N) behavior to count the number of quotes preceding each character just isn't funny. If you already know there are X quotes preceding Str[N], it should be an O(1) operation to determine how many quotes precede Str[N+1], not O(N). The possible answers are after all just X or X+1 !

6 Comments

That’s a question and not an answer. Use a comment.
It's an answer with a question mark ;) The problem is that he's using the wrong tool (regex instead of a stack-based parser) for his problem. And there's no "close reason: problem can't be solved with regex"
The reason I ask is because I wanted to avoid using a parser. I wanted the "cheap" solution. If there is no solution using regex, provide a mathematical proof and I will accept it as answer :-) Doesn't even need to be strict and rigor :-)
Let's give it a shot. Def: A space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”). That means you need the following regex: ^.*(".*".*)*".* .*".*(".*".*)*$ Hmm, I'd better turn that into an answer :P
Great idea of odd and even counting :-) But shouldn't there be a * after the second closing ] ? And did you try to match the spaces between the quotes? I did want the others. But anyway, thats a starting point :-)
|
4

MSalters pushed me on the right track. The problem with his answer that the regex he gives always matches the whole string and so is unsuitable for split(), but this can partly redeemed by a lookahead match. Assuming that the quotes are always paired (they are indeed), I can split on every space which is followed by an even number of quotes.

The regex without C escapes and in single quotes looks like

' (?=[^"]*("[^"]*"[^"]*)*$)'

In the source it finally looked like (using Qt and C++)

QString buf("Mary had \"a little lamb\""); // string we want to split
QStringList splitted = buf.split( QRegExp(" (?=[^\"]*(\"[^\"]*\"[^\"]*)*$)") );

Simple, eh?

For the performance, the strings are parsed once at the start of the program, they are a few dozen and they are less than hundred chars. I will test its runtime with long strings, just to be sure nothing bad happens ;-)

Comments

1

If the quoting in the strings is simple (like your examples), you can use alternation. This regex first hunts for a simple quoted string; failing that it finds spaces.

/(\"[^\"]*\"| +)/

In Perl, if you use grouping in the regex when calling split(), the function returns not only the elements but also the captured groups (in this case, our delimiter). If you then filter out the blank and spaces-only delimiters, you will get the desired list of elements. I don't know whether a similar strategy would work in C++, but the following Perl code does work:

use strict;
use warnings;
while (<DATA>){
    chomp;
    my @elements = split /(\"[^\"]*\"| +)/, $_;
    @elements = grep {length and /[^ ]/} @elements;
    # Do stuff with @elements
}

__DATA__
Mary had "a little lamb"
1" 2 "3
abc def="g h i" "j k" = 12

Comments

-2

Simplest regex-solution: match whole spaces AND quotes. Filter quotes later

"[^"]*"|\s

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.