Efficiently slicing a string in Python3

Question

Since python does slice-by-copy, slicing strings can be very costly.

I have a recursive algorithm that is operating on strings. Specifically, if a function is passed a string a, the function calls itself on a[1:] of the passed string. The hangup is that the strings are so long, the slice-by-copy mechanism is becoming a very costly way to remove the first character.

Is there a way to get around this, or do I need to rewrite the algorithm entirely?

You're totally right, I just misunderstood how memoryview worked and thought it was the same. Closing the question. — stoksc
– stoksc, Commented Mar 6, 2018 at 2:22
Also, a[:1] is slicing out the first character (if any), which is incredibly cheap. Did you mean a[1:] (which would slice all but the first character)? — ShadowRanger
– ShadowRanger, Commented Mar 6, 2018 at 2:25
@sudo: List slices create new lists; sure, the len 1 str objects wouldn't be copied, but the pointers to them would be, and the pointers are 4-8 bytes a piece, vs. 1-4 bytes a piece for each character in a string. The big-O cost of a list slice is identical to that of a str slice. For Python built-in types, about the only types with O(1) slicing are memoryview and (on Py3) range. numpy adds whole slew of view-like sequences, but it's not a built-in package. — ShadowRanger
– ShadowRanger, Commented Mar 6, 2018 at 2:50
@sudo: str slices copy instead of creating views partially to keep the implementation simpler (the raw data can be allocated with the object header in a single block, with no need to have the data allocated separately with separate reference counts, and avoiding the need to store an offset into the data), and partially to avoid keepalive effects. If someone does something like smallstr = mystr[1:11] where mystr is 1 GB long, it would be ridiculous to keep mystr alive forever just because smallstr was looking at 10 characters of it. — ShadowRanger
– ShadowRanger, Commented Mar 6, 2018 at 4:55

ShadowRanger · Accepted Answer · 2018-03-06 02:37:46Z

7

The only way to get around this in general is to make your algorithm uses bytes-like types, either Py2 str or Py3 bytes; views of Py2 unicode/Py3 str are not supported. I provided details on how to do this on my answer to a related question, but the short version is, if you can assume bytes-like arguments (or convert to them), wrapping the argument in a memoryview and slicing is a reasonable solution. Once converted to a memoryview, slicing produces new memoryviews with O(1) cost (in both time and memory), rather than the O(n) time/memory cost of text slicing.

answered Mar 6, 2018 at 2:37

ShadowRanger

158k12 gold badges221 silver badges315 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

sudo Over a year ago

n.b. bytes won't work unless the string is a fixed-length encoding like ASCII. For example, utf8 won't work, but you could easily convert to utf32 to get around this.

ShadowRanger Over a year ago

@sudo: Yeah, my other answer covers that issue. It's not quite as easy as just "convert to utf-32 though"; UTF-32 is fixed length, but you either have to give up bytes-like behavior (e.g. on Py3 casting the memoryview to a four byte format) or keep it bytes-like but manually adjust for the larger character width each time (slicing off the first four bytes rather than the first character each time). It can be a pain.

stoksc Over a year ago

Thanks, solid answer. I feel like it's really easy to get caught up with how easy operations are in Python and forget about their efficiency entirely. Curious, how did you learn all these big-O guarantees for Python operations?

ShadowRanger Over a year ago

@lieblos: The language spec doesn't provide guarantees in most cases (the Python tutorial mentions that slicing a list shallow copies, but otherwise doesn't get into it). Stuff like str slices being copies vs. views isn't a language guarantee; at one point they considered making str slices views, but decided against it largely because of the implicit, non-intuitive keep-alive effect (slicing two characters out of a 1 GB str causing the 1 GB str to live until the slice is cleaned up). Mostly I just read the Python bug tracker, the PEPs, and the What's New pages for each release.

ShadowRanger Over a year ago

@sudo: To an extent, yes. The requirements imposed on the various built-in types mean that you only have a few practical options, all of which shared the same big-O performance. Some of the big-Os are documented sort of by accident (e.g. list having O(n) memory movement costs for pop/insert on the left is documented as a detail of collections.deque, which exists to alleviate that problem, as is list's O(1) random access, which collections.deque lacks). In practice, list is always like C++ vector, set/dict are hash based, etc. New solutions aren't invented often.

|

Collectives™ on Stack Overflow

Efficiently slicing a string in Python3

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related