parsing utf8 string from server response

Question

I had implemented app on some device which was dealing with sending receiving data from server. Data from server would usually come in this form:

"1;username;someInteger;"

Parsing was easy, and I was using strtok as you can imagine to retrieve individual values from that string such as: 1, username, and someInteger.

But now a situation may occur when the server will send me unicode string as username.

I think good idea is to use the username encoded as a UTF-8 string (am I right?). What do you recommend - how should I parse it from above string? What symbol to use as a separator for example (e.g., instead of ";"), or which functions to use to extract the username from above string?

as this is some embedded device I want to avoid installing some third party libraries there (which might not be even possible) so more "pure" ways would be more desirable.

Avoid strtok. It’s not thread-safe. Use boost::split instead. — user1804599
– user1804599, Commented Oct 30, 2013 at 8:52
@rightfold Avoid boost for a simple substitution of strtok(). It's too big. Use strtok_r() instead. — user529758
– user529758, Commented Oct 30, 2013 at 8:52
@H2CO3: Yes like I mentioned this is embedded device - I am yet trying to avoid installing some large third party libraries there (not sure even if that's possible) — user2793162
– user2793162, Commented Oct 30, 2013 at 8:53
@H2CO3 did you see strtok source code, or binary code generated for it? is it "small" comparing to "boost::split"? — Abyx
– Abyx, Commented Oct 30, 2013 at 9:54
@Abyx Where are all my comments? As to your question: here is an implementation of strtok(), and here is iter_split() that boost::algorithm::string::split() uses. Altogether, strtok() is fewer SLOC than the boost thingy, but it has the advantage that it's 1. standard, 2. doesn't require inclusion of huge headers, 3. it works in C too. — user529758
– user529758, Commented Oct 30, 2013 at 10:21

Some programmer dude · Accepted Answer · 2013-10-30 08:52:51Z

4

The character ';' is the same in UTF-8 as it is in ASCII, because the 127 first characters in both encodings are the same. That means you can still use strtok to split on the ';'.

answered Oct 30, 2013 at 8:52

Some programmer dude

411k36 gold badges420 silver badges655 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user2793162 Over a year ago

I have heard strtok may "stop" if in between it encounters "null terminating character" - can't it be the case that the Unicode string contains some characters in between two separators (;) that strtok will interpret as null terminating character?

Arne Mertz Over a year ago

@dmcr_code no, Multibyte sequences contain only bytes with values >= 128 (or <0, depending on char signedness), so any ASCII byte is an ASCII code point.there's no codepoint except the null character that contains a null byte (see stackoverflow.com/questions/6907297/can-utf-8-contain-zero-byte). In other words: if strtok encounters a null byte, it is the zero-delimiter and nothing else. The same applies for any other ASCII value:

Arne Mertz Over a year ago

meaning: the ; can't be found as part of a multibyte sequence either, so you won't get false positives.

user2793162 Over a year ago

@Arne Mertz: ok I got it I was just wondering maybe there would be some non ascii characters in the string whose code value(code point) was like this for instance: 45 00 - then strtok would interpret last byte of this symbol as null terminator right? (but I think you said this can't be the case)

Arne Mertz Over a year ago

@dmcr_code yes, that can't be. Neither 45 nor 00 can be part of a multibyte sequence (non-ASCII code point) - they contain only values > 80 (hex)

Dolda2000 · Accepted Answer · 2013-10-30 08:51:57Z

0

The very thing with UTF8 is that you hardly have to do anything at all. ASCII characters still encode as the same ASCII bytes they always would, so if you just continue to use semicolon separators, you don't have to do anything at all.

answered Oct 30, 2013 at 8:51

Dolda2000

26k4 gold badges58 silver badges94 bronze badges

2 Comments

user2793162 Over a year ago

I think strtok may stop if in between it encounters "null terminating character" - can't it be the case that the Unicode string contains some characters in between two separators (;) that strtok will interpret as null terminating character?

Dolda2000 Over a year ago

Nope: the only "null terminating character" is ASCII 0, aka NUL. No UTF-8 encoding contains any NUL characters except that for NUL itself, which, again, is just as it would be in normal ASCII.

Collectives™ on Stack Overflow

parsing utf8 string from server response

2 Answers 2

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related