2

After an initial search on this, I'm bit lost.

I want to use a buffer object to hold a sequence of Unicode code points. I just need to scan and extract tokens from said sequence, so basically this is a read only buffer, and we need functionality to advance a pointer within the buffer, and to extract sub-segments. The buffer object should of course support the usual regex and search ops on strings.

An ordinary Unicode string can be used for this, but the issue would be the creating of sub-string copies to simulate advancing a pointer within the buffer. This seems to be very inefficient esp for larger buffers, unless there's some workaround.

I can see that there's a Memoryview object that would be suitable, but it does not support Unicode (?).

What else can I use to provide the above functionality? (Whether in Py2 or Py3).

1 Answer 1

1

It depends on what exactly is needed, but usually just one Unicode string is enough. If you need to take non-tiny slices, you can keep them as 3-tuples (big unicode, start pos, end pos) or just make custom objects with these 3 attributes and whatever API is needed. The point is that a lot of methods like unicode.find() or the regex pattern objects's search() support specifying start and end points. So you can do most basic things without actually needing to slice the single big unicode string.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. This should be okay for my case, I hope a proper Unicode buffer is on someone's drawing board.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.