0

I had implemented app on some device which was dealing with sending receiving data from server. Data from server would usually come in this form:

"1;username;someInteger;"

Parsing was easy, and I was using strtok as you can imagine to retrieve individual values from that string such as: 1, username, and someInteger.

But now a situation may occur when the server will send me unicode string as username.

I think good idea is to use the username encoded as a UTF-8 string (am I right?). What do you recommend - how should I parse it from above string? What symbol to use as a separator for example (e.g., instead of ";"), or which functions to use to extract the username from above string?

as this is some embedded device I want to avoid installing some third party libraries there (which might not be even possible) so more "pure" ways would be more desirable.

5
  • Avoid strtok. It’s not thread-safe. Use boost::split instead. Commented Oct 30, 2013 at 8:52
  • 1
    @rightfold Avoid boost for a simple substitution of strtok(). It's too big. Use strtok_r() instead. Commented Oct 30, 2013 at 8:52
  • 1
    @H2CO3: Yes like I mentioned this is embedded device - I am yet trying to avoid installing some large third party libraries there (not sure even if that's possible) Commented Oct 30, 2013 at 8:53
  • @H2CO3 did you see strtok source code, or binary code generated for it? is it "small" comparing to "boost::split"? Commented Oct 30, 2013 at 9:54
  • @Abyx Where are all my comments? As to your question: here is an implementation of strtok(), and here is iter_split() that boost::algorithm::string::split() uses. Altogether, strtok() is fewer SLOC than the boost thingy, but it has the advantage that it's 1. standard, 2. doesn't require inclusion of huge headers, 3. it works in C too. Commented Oct 30, 2013 at 10:21

2 Answers 2

4

The character ';' is the same in UTF-8 as it is in ASCII, because the 127 first characters in both encodings are the same. That means you can still use strtok to split on the ';'.

Sign up to request clarification or add additional context in comments.

5 Comments

I have heard strtok may "stop" if in between it encounters "null terminating character" - can't it be the case that the Unicode string contains some characters in between two separators (;) that strtok will interpret as null terminating character?
@dmcr_code no, Multibyte sequences contain only bytes with values >= 128 (or <0, depending on char signedness), so any ASCII byte is an ASCII code point.there's no codepoint except the null character that contains a null byte (see stackoverflow.com/questions/6907297/can-utf-8-contain-zero-byte). In other words: if strtok encounters a null byte, it is the zero-delimiter and nothing else. The same applies for any other ASCII value:
meaning: the ; can't be found as part of a multibyte sequence either, so you won't get false positives.
@Arne Mertz: ok I got it I was just wondering maybe there would be some non ascii characters in the string whose code value(code point) was like this for instance: 45 00 - then strtok would interpret last byte of this symbol as null terminator right? (but I think you said this can't be the case)
@dmcr_code yes, that can't be. Neither 45 nor 00 can be part of a multibyte sequence (non-ASCII code point) - they contain only values > 80 (hex)
0

The very thing with UTF8 is that you hardly have to do anything at all. ASCII characters still encode as the same ASCII bytes they always would, so if you just continue to use semicolon separators, you don't have to do anything at all.

2 Comments

I think strtok may stop if in between it encounters "null terminating character" - can't it be the case that the Unicode string contains some characters in between two separators (;) that strtok will interpret as null terminating character?
Nope: the only "null terminating character" is ASCII 0, aka NUL. No UTF-8 encoding contains any NUL characters except that for NUL itself, which, again, is just as it would be in normal ASCII.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.