Regex matching spaces, but not in "strings"

Question

I am looking for a regular exression matching spaces only if thos spaces are not enclosed in double quotes ("). For example, in

Mary had "a little lamb"

it should match the first an the second space, but not the others.

I want to split the string only at the spaces which are not in the double quotes, and not at the quotes.

I am using C++ with the Qt toolkit and wanted to use QString::split(QRegExp). QString is very similar to std::string and QRegExp are basically POSIX regex encapsulated in a class. If there exist such a regex, the split would be trivial.

Examples:

Mary had "a little lamb"     =>   Mary,had,"a little lamb"
1" 2 "3                      =>   1" 2 "3 (no splitting at ")
abc def="g h i" "j k" = 12   =>   abc,def="g h i","j k",=,12

Sorry for the edits, I was very imprecise when I asked the question first. Hope it is somewhat more clear now.

The question is answered here: Using regex to replace all spaces NOT in quotes in Ruby — Ron Gejman
– Ron Gejman, Commented Aug 21, 2009 at 7:45

Alan Moore · Accepted Answer · 2009-08-21 17:04:08Z

9

(I know you just posted almost exactly the same answer yourself, but I can't bear to just throw all this away. :-/)

If it's possible to solve your problem with a regex split operation, the regex will have to match even numbers of quotation marks, as MSalters said. However, a split regex should match only the spaces you're splitting on, so the rest of the work has to be done in a lookahead. Here's what I would use:

" +(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

If the text is well formed, a lookahead for an even number of quotes is sufficient to determine that the just-matched space is not inside a quoted sequence. That is, lookbehinds aren't necessary, which is good because QRegExp doesn't seem to support them. Escaped quotes can be accommodated too, but the regex becomes quite a bit larger and uglier. But if you can't be sure the text is well formed, it's extremely unlikely you'll be able to solve your problem with split().

By the way, QRegExp does not implement POSIX regular expressions--if it did, it wouldn't support lookaheads OR lookbehinds. Instead, it falls into the loosely-defined category of Perl-compatible regex flavors.

edited Aug 21, 2009 at 17:04

answered Aug 21, 2009 at 15:52

Alan Moore

75.6k13 gold badges109 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gunther Piez Over a year ago

Well, I wrote basically POSIX ;-) When I searched for lookbehinds, I noticed that there was something missing. But shortly after, I noticed I didn't even need them, assuming always paired quotes. Maybe I should file a bug at Qt. Anyway, +1 for the improvment of the regex.

Alan Moore Over a year ago

I realize you were using the term "POSIX" loosely; I was merely pointing out others might not. I had to look up QRegExp to be sure it wasn't POSIX-standard (or, more specifically, that I could recommend a lookahead-based solution).

MSalters · Accepted Answer · 2009-08-21 15:27:01Z

4

What should happen to "a" b "c" ?

Note that in the substring " b " the spaces are between quotes.

-- edit --

I assume a space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”).

That means you need the following regex: ^[^"]*("[^"]*"[^"]*)*"[^"]* [^"]*"[^"]*("[^"]*"[^"]*)*$

("[^"]*"[^"]*) represents a pair of quotes. ("[^"]*"[^"]*)* is an even amount of quotes, ("[^"]"[^"]*)*" an odd amount. Then there's the actual quoted string part, followed by another odd number of quotes. ^$ anchors are needed because you need to count every quote from the beginning of the string. This answers the " b " substring problem above by never looking at substrings. The price is that every character in your input must be matched against the entire string, which turns this into an O(N*N) split operation.

The reason why you can do this in a regex is because there is a finite amount of memory needed. Effectively just one bit; "have I seen an odd or even number of quotes so far?". You don't actually have to match up individual "" pairs.

This is not the only interpretation possible, though. If you do include “funny Unicode quotes” which should be paired, you also need to deal with ““double quoted”” strings. This in turn means you need a count of open “, which means you need infinite storage, which in turns means it's no longer a regular language, which means you can't use a regex. QED.

Anyway, even if it was possible, you still would want a proper parser. The O(N*N) behavior to count the number of quotes preceding each character just isn't funny. If you already know there are X quotes preceding Str[N], it should be an O(1) operation to determine how many quotes precede Str[N+1], not O(N). The possible answers are after all just X or X+1 !

edited Aug 21, 2009 at 15:27

answered Aug 21, 2009 at 7:49

MSalters

182k11 gold badges171 silver badges376 bronze badges

6 Comments

Gumbo Over a year ago

That’s a question and not an answer. Use a comment.

MSalters Over a year ago

It's an answer with a question mark ;) The problem is that he's using the wrong tool (regex instead of a stack-based parser) for his problem. And there's no "close reason: problem can't be solved with regex"

Gunther Piez Over a year ago

The reason I ask is because I wanted to avoid using a parser. I wanted the "cheap" solution. If there is no solution using regex, provide a mathematical proof and I will accept it as answer :-) Doesn't even need to be strict and rigor :-)

MSalters Over a year ago

Let's give it a shot. Def: A space is "between quotes" if it is preceded and followed by an odd number of standard quotation marks (i.e. U+0022, I'll ignore those funny Unicode “quotes”). That means you need the following regex: ^.*(".*".*)*".* .*".*(".*".*)*$ Hmm, I'd better turn that into an answer :P

Gunther Piez Over a year ago

Great idea of odd and even counting :-) But shouldn't there be a * after the second closing ] ? And did you try to match the spaces between the quotes? I did want the others. But anyway, thats a starting point :-)

|

Gunther Piez · Accepted Answer · 2009-08-21 15:47:05Z

MSalters pushed me on the right track. The problem with his answer that the regex he gives always matches the whole string and so is unsuitable for split(), but this can partly redeemed by a lookahead match. Assuming that the quotes are always paired (they are indeed), I can split on every space which is followed by an even number of quotes.

The regex without C escapes and in single quotes looks like

' (?=[^"]*("[^"]*"[^"]*)*$)'

In the source it finally looked like (using Qt and C++)

QString buf("Mary had \"a little lamb\""); // string we want to split
QStringList splitted = buf.split( QRegExp(" (?=[^\"]*(\"[^\"]*\"[^\"]*)*$)") );

Simple, eh?

For the performance, the strings are parsed once at the start of the program, they are a few dozen and they are less than hundred chars. I will test its runtime with long strings, just to be sure nothing bad happens ;-)

FMc · Accepted Answer · 2009-08-21 14:40:10Z

If the quoting in the strings is simple (like your examples), you can use alternation. This regex first hunts for a simple quoted string; failing that it finds spaces.

/(\"[^\"]*\"| +)/

In Perl, if you use grouping in the regex when calling split(), the function returns not only the elements but also the captured groups (in this case, our delimiter). If you then filter out the blank and spaces-only delimiters, you will get the desired list of elements. I don't know whether a similar strategy would work in C++, but the following Perl code does work:

use strict;
use warnings;
while (<DATA>){
    chomp;
    my @elements = split /(\"[^\"]*\"| +)/, $_;
    @elements = grep {length and /[^ ]/} @elements;
    # Do stuff with @elements
}

__DATA__
Mary had "a little lamb"
1" 2 "3
abc def="g h i" "j k" = 12

Alan Moore · Accepted Answer · 2009-08-22 02:02:56Z

-2

Simplest regex-solution: match whole spaces AND quotes. Filter quotes later

"[^"]*"|\s

edited Aug 22, 2009 at 2:02

Alan Moore

75.6k13 gold badges109 silver badges161 bronze badges

answered Aug 21, 2009 at 7:54

maykeye

1,3161 gold badge12 silver badges16 bronze badges

Collectives™ on Stack Overflow

Regex matching spaces, but not in "strings"

5 Answers 5

2 Comments

6 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related