Split a string by spaces -- preserving quoted substrings -- in Python

Question

I have a string which is like this:

this is "a test"

I'm trying to write something in Python to split it up by space while ignoring spaces within quotes. The result I'm looking for is:

['this', 'is', 'a test']

PS. I know you are going to ask "what happens if there are quotes within the quotes, well, in my application, that will never happen.

Jerub · Accepted Answer · 2022-04-05 08:41:58Z

540

You want split, from the built-in shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

This should do exactly what you want.

If you want to preserve the quotation marks, then you can pass the posix=False kwarg.

>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']

edited Apr 5, 2022 at 8:41

answered Sep 17, 2008 at 4:27

Jerub

42.8k15 gold badges76 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

drootang Over a year ago

This is the simplest method that directly answers OP's question. If you need support for nested strings using escaped characters and/or multiple quote types, see the answer from @user261478

Pavel Štěrba · Accepted Answer · 2016-11-01 14:37:43Z

77

Have a look at the shlex module, particularly shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

edited Nov 1, 2016 at 14:37

Pavel Štěrba

2,9222 gold badges31 silver badges51 bronze badges

answered Sep 17, 2008 at 4:27

Allen

5,11625 silver badges30 bronze badges

1 Comment

xaviersjs Over a year ago

Wow, impressive. You posted at the exact same time as @Jerub. And 2 minutes after the question!

score 47 · Accepted Answer · 2010-03-16 21:46:35Z

47

I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

Explanation:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex probably provides more features, though.

edited Mar 16, 2010 at 21:46

answered Feb 7, 2009 at 23:17

Kate

17 Comments

Darius Bacon Over a year ago

I was thinking much the same, but would suggest instead [t.strip('"') for t in re.findall(r'[^\s"]+|"[^"]*"', 'this is "a test"')]

hanleyp Over a year ago

+1 I'm using this because it was a heck of a lot faster than shlex.

SpliFF Over a year ago

that code almost looks like perl, haven't you heard of r"raw strings"?

Doppelganger Over a year ago

Why the triple backslash ? won't a simple backslash do the same ?

asmeurer Over a year ago

You should use raw strings when using regular expressions.

|

Ryan Ginstrom · Accepted Answer · 2019-10-31 23:46:43Z

34

Depending on your use case, you may also want to check out the csv module:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

Output:

['this', 'is', 'a string']
['and', 'more', 'stuff']

edited Oct 31, 2019 at 23:46

user3064538

answered Feb 8, 2009 at 2:18

Ryan Ginstrom

14.2k5 gold badges49 silver badges60 bronze badges

4 Comments

scraplesh Over a year ago

useful, when shlex strips some needed characters

user3064538 Over a year ago

CSV's use two double quotes in a row (as in side-by-side, "") to represent one double quote ", so will turn two double quotes into a single quote 'this is "a string""' and 'this is "a string"""' will both map to ['this', 'is', 'a string"']

Vinod Over a year ago

If the delimiter is other than space, shlex is adding the delimiter to individual strings.

Domenico Spidy Tamburro Over a year ago

useful, I had the case of the comma as the thousand separator like ['UK', 'London', '1,234,567.89] then using for row in csv.reader(lines, delimiter="," interprets the records correclty

Daniel Dai · Accepted Answer · 2014-04-18 13:29:10Z

18

I use shlex.split to process 70,000,000 lines of squid log, it's so slow. So I switched to re.

Please try this, if you have performance problem with shlex.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

answered Apr 18, 2014 at 13:29

Daniel Dai

1,0891 gold badge12 silver badges26 bronze badges

Comments

hochl · Accepted Answer · 2018-11-08 20:34:06Z

13

It seems that for performance reasons re is faster. Here is my solution using a least greedy operator that preserves the outer quotes:

re.findall("(?:\".*?\"|\S)+", s)

Result:

['this', 'is', '"a test"']

It leaves constructs like aaa"bla blub"bbb together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

Please note that this also matches the empty string "" by means of the \S part of the pattern.

edited Nov 8, 2018 at 20:34

answered Nov 8, 2018 at 15:21

hochl

13.1k10 gold badges58 silver badges92 bronze badges

1 Comment

a_guest Over a year ago

Another important advantage of this solution is its versatility with respect to the delimiting character (e.g. , via '(?:".*?"|[^,])+'). The same applies to the quoting (enclosing) character(s).

Ton van den Heuvel · Accepted Answer · 2021-04-19 16:03:41Z

The main problem with the accepted shlex approach is that it does not ignore escape characters outside quoted substrings, and gives slightly unexpected results in some corner cases.

I have the following use case, where I need a split function that splits input strings such that either single-quoted or double-quoted substrings are preserved, with the ability to escape quotes within such a substring. Quotes within an unquoted string should not be treated differently from any other character. Some example test cases with the expected output:

 input string        | expected output
===============================================
 'abc def'           | ['abc', 'def']
 "abc \\s def"       | ['abc', '\\s', 'def']
 '"abc def" ghi'     | ['abc def', 'ghi']
 "'abc def' ghi"     | ['abc def', 'ghi']
 '"abc \\" def" ghi' | ['abc " def', 'ghi']
 "'abc \\' def' ghi" | ["abc ' def", 'ghi']
 "'abc \\s def' ghi" | ['abc \\s def', 'ghi']
 '"abc \\s def" ghi' | ['abc \\s def', 'ghi']
 '"" test'           | ['', 'test']
 "'' test"           | ['', 'test']
 "abc'def"           | ["abc'def"]
 "abc'def'"          | ["abc'def'"]
 "abc'def' ghi"      | ["abc'def'", 'ghi']
 "abc'def'ghi"       | ["abc'def'ghi"]
 'abc"def'           | ['abc"def']
 'abc"def"'          | ['abc"def"']
 'abc"def" ghi'      | ['abc"def"', 'ghi']
 'abc"def"ghi'       | ['abc"def"ghi']
 "r'AA' r'.*_xyz$'"  | ["r'AA'", "r'.*_xyz$'"]
 'abc"def ghi"'      | ['abc"def ghi"']
 'abc"def ghi""jkl"' | ['abc"def ghi""jkl"']
 'a"b c"d"e"f"g h"'  | ['a"b c"d"e"f"g h"']
 'c="ls /" type key' | ['c="ls /"', 'type', 'key']
 "abc'def ghi'"      | ["abc'def ghi'"]
 "c='ls /' type key" | ["c='ls /'", 'type', 'key']

I ended up with the following function to split a string such that the expected output results for all input strings:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]

It ain't pretty; but it works. The following test application checks the results of other approaches (shlex and csv for now) and the custom split implementation:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])
    test_case_fn(fn, 'abc"def ghi"', ['abc"def ghi"'])
    test_case_fn(fn, 'abc"def ghi""jkl"', ['abc"def ghi""jkl"'])
    test_case_fn(fn, 'a"b c"d"e"f"g h"', ['a"b c"d"e"f"g h"'])
    test_case_fn(fn, 'c="ls /" type key', ['c="ls /"', 'type', 'key'])
    test_case_fn(fn, "abc'def ghi'", ["abc'def ghi'"])
    test_case_fn(fn, "c='ls /' type key", ["c='ls /'", 'type', 'key'])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

Output:

shlex

[ OK ] abc def -> ['abc', 'def']
[FAIL] abc \s def -> ['abc', 's', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[FAIL] 'abc \' def' ghi -> exception: No closing quotation
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[FAIL] abc'def -> exception: No closing quotation
[FAIL] abc'def' -> ['abcdef']
[FAIL] abc'def' ghi -> ['abcdef', 'ghi']
[FAIL] abc'def'ghi -> ['abcdefghi']
[FAIL] abc"def -> exception: No closing quotation
[FAIL] abc"def" -> ['abcdef']
[FAIL] abc"def" ghi -> ['abcdef', 'ghi']
[FAIL] abc"def"ghi -> ['abcdefghi']
[FAIL] r'AA' r'.*_xyz$' -> ['rAA', 'r.*_xyz$']
[FAIL] abc"def ghi" -> ['abcdef ghi']
[FAIL] abc"def ghi""jkl" -> ['abcdef ghijkl']
[FAIL] a"b c"d"e"f"g h" -> ['ab cdefg h']
[FAIL] c="ls /" type key -> ['c=ls /', 'type', 'key']
[FAIL] abc'def ghi' -> ['abcdef ghi']
[FAIL] c='ls /' type key -> ['c=ls /', 'type', 'key']

csv

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[FAIL] "abc \" def" ghi -> ['abc \\', 'def"', 'ghi']
[FAIL] 'abc \' def' ghi -> ["'abc", "\\'", "def'", 'ghi']
[FAIL] 'abc \s def' ghi -> ["'abc", '\\s', "def'", 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[FAIL] '' test -> ["''", 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]
[FAIL] abc"def ghi" -> ['abc"def', 'ghi"']
[FAIL] abc"def ghi""jkl" -> ['abc"def', 'ghi""jkl"']
[FAIL] a"b c"d"e"f"g h" -> ['a"b', 'c"d"e"f"g', 'h"']
[FAIL] c="ls /" type key -> ['c="ls', '/"', 'type', 'key']
[FAIL] abc'def ghi' -> ["abc'def", "ghi'"]
[FAIL] c='ls /' type key -> ["c='ls", "/'", 'type', 'key']

re

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[ OK ] 'abc \' def' ghi -> ["abc ' def", 'ghi']
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]
[ OK ] abc"def ghi" -> ['abc"def ghi"']
[ OK ] abc"def ghi""jkl" -> ['abc"def ghi""jkl"']
[ OK ] a"b c"d"e"f"g h" -> ['a"b c"d"e"f"g h"']
[ OK ] c="ls /" type key -> ['c="ls /"', 'type', 'key']
[ OK ] abc'def ghi' -> ["abc'def ghi'"]
[ OK ] c='ls /' type key -> ["c='ls /'", 'type', 'key']

shlex: 0.335ms per iteration
csv: 0.036ms per iteration
re: 0.068ms per iteration

So performance is much better than shlex, and can be improved further by precompiling the regular expression, in which case it will outperform the csv approach.

Not sure what you're talking about: ``` >>> shlex.split('this is "a test"') ['this', 'is', 'a test'] >>> shlex.split('this is \\"a test\\"') ['this', 'is', '"a', 'test"'] >>> shlex.split('this is "a \\"test\\""') ['this', 'is', 'a "test"'] ```
@morsik, what is your point? Maybe your use case does not match mine? When you look at the test cases you'll see all cases where shlex does not behave as expected for my use cases.
I was hopefull, but unfortunately, you approach fails too in a case I need where shlex and csv fail also. String to parse: command="echo hi" type key.
@Jean-BernardJansen, there were indeed some issues when it comes to handling quotes; I've updated the regex and it should now handle your case correctly.

elifiner · Accepted Answer · 2008-09-17 06:08:38Z

8

Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.

Both versions do the same thing, but splitter is a bit more readable then splitter2.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

answered Sep 17, 2008 at 6:08

elifiner

7,6659 gold badges43 silver badges48 bronze badges

2 Comments

Devin Jeanpierre Over a year ago

You should have used re.Scanner instead. It's more reliable (and I have in fact implemented a shlex-like using re.Scanner).

leetNightshade Over a year ago

+1 Hm, this is a pretty smart idea, breaking the problem down into multiple steps so the answer isn't terribly complex. Shlex didn't do exactly what I needed, even with trying to tweak it. And the single pass regex solutions were getting really weird and complicated.

har777 · Accepted Answer · 2018-04-12 08:36:18Z

7

Speed test of different answers:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

edited Apr 12, 2018 at 8:36

answered Apr 12, 2018 at 8:28

har777

5035 silver badges12 bronze badges

Comments

THE_MAD_KING · Accepted Answer · 2017-03-26 23:08:09Z

4

To preserve quotes use this function:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

answered Mar 26, 2017 at 23:08

THE_MAD_KING

411 bronze badge

1 Comment

FaranAiki Over a year ago

When comparing with bigger string, your function is so slow

user261478 · Accepted Answer · 2010-01-29 01:36:23Z

3

Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

answered Jan 29, 2010 at 1:36

user261478

3 Comments

drootang Over a year ago

This is by far the best answer. Using a negative lookbehind is the the best way to ensure you don't match escaped end-quote characters and don't start a new quote with an escaped start-quote character. It's easily extensible to support multiple quoting characters (e.g., ", ', {}, [], etc)

drootang Over a year ago

In my case I needed to preserve the quote character on each string, so i just removed the .strip() commands in the list comprehension

Calab Over a year ago

This almost works, except it is splitting at a punctuation quote... ie: "Test:'test'" should return ["Test:'test'"] but instead returns ["Test:", "test"].

Mikhail Zakharov · Accepted Answer · 2020-03-30 11:49:43Z

3

As an option try tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

answered Mar 30, 2020 at 11:49

Mikhail Zakharov

1,1791 gold badge12 silver badges23 bronze badges

Comments

moschlar · Accepted Answer · 2012-06-25 17:51:17Z

1

To get around the unicode issues in some Python 2 versions, I suggest:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

answered Jun 25, 2012 at 17:51

moschlar

1,33611 silver badges18 bronze badges

1 Comment

Peter Varo Over a year ago

For python 2.7.5 this should be: split = lambda a: [b.decode('utf-8') for b in _split(a)] otherwise you get: UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)

hussic · Accepted Answer · 2015-09-09 13:19:59Z

0

I suggest:

test string:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

to capture also "" and '':

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

to ignore empty "" and '':

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

answered Sep 9, 2015 at 13:19

hussic

1,97010 silver badges10 bronze badges

1 Comment

hochl Over a year ago

Could be written as re.findall("(?:\".*?\"|'.*?'|[^\s'\"]+)", s) also.

Gregory · Accepted Answer · 2008-09-17 05:46:57Z

-3

If you don't care about sub strings than a simple

>>> 'a short sized string with spaces '.split()

Performance:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

Or string module

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

Performance: String module seems to perform better than string methods

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

Or you can use RE engine

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

Performance

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop

answered Sep 17, 2008 at 5:46

Gregory

1,4871 gold badge15 silver badges22 bronze badges

1 Comment

rjmunro Over a year ago

You seem to have missed the whole point of the question. There are quoted sections in the string that need to not be split.

pjz · Accepted Answer · 2016-09-23 00:04:24Z

-3

Try this:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

Some test strings:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

edited Sep 23, 2016 at 0:04

answered Sep 17, 2008 at 4:31

pjz

43.4k6 gold badges54 silver badges60 bronze badges

4 Comments

pjz Over a year ago

Please supply the repr of a string you think will fail.

Matthew Schinckel Over a year ago

Think? adamsplit("This is 'a test'") → ['This', 'is', "'a", "test'"]

pjz Over a year ago

OP only says "within quotes" and only has an example with double-quotes.

Quantalabs Over a year ago

Is there a way however to preserve the quotes themselves? For example, ['This', 'is', "'a test'"]

Collectives™ on Stack Overflow

Split a string by spaces -- preserving quoted substrings -- in Python

16 Answers 16

1 Comment

1 Comment

17 Comments

4 Comments

Comments

1 Comment

4 Comments

2 Comments

Comments

1 Comment

3 Comments

Comments

1 Comment

1 Comment

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

1 Comment

1 Comment

17 Comments

4 Comments

Comments

1 Comment

4 Comments

2 Comments

Comments

1 Comment

3 Comments

Comments

1 Comment

1 Comment

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related