How to implement a Unicode buffer in python

Question

After an initial search on this, I'm bit lost.

I want to use a buffer object to hold a sequence of Unicode code points. I just need to scan and extract tokens from said sequence, so basically this is a read only buffer, and we need functionality to advance a pointer within the buffer, and to extract sub-segments. The buffer object should of course support the usual regex and search ops on strings.

An ordinary Unicode string can be used for this, but the issue would be the creating of sub-string copies to simulate advancing a pointer within the buffer. This seems to be very inefficient esp for larger buffers, unless there's some workaround.

I can see that there's a Memoryview object that would be suitable, but it does not support Unicode (?).

What else can I use to provide the above functionality? (Whether in Py2 or Py3).

Armin Rigo · Accepted Answer · 2013-07-28 10:15:55Z

1

It depends on what exactly is needed, but usually just one Unicode string is enough. If you need to take non-tiny slices, you can keep them as 3-tuples (big unicode, start pos, end pos) or just make custom objects with these 3 attributes and whatever API is needed. The point is that a lot of methods like unicode.find() or the regex pattern objects's search() support specifying start and end points. So you can do most basic things without actually needing to slice the single big unicode string.

answered Jul 28, 2013 at 10:15

Armin Rigo

13.1k41 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Basel Shishani Over a year ago

Thanks. This should be okay for my case, I hope a proper Unicode buffer is on someone's drawing board.

Collectives™ on Stack Overflow

How to implement a Unicode buffer in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related