std::regex fatal error

Question

I'd like to think this isn't actually a bug in the standard library, but I'm running out of places to look.

The statement std::regex(expression) where expression is a std::string causes a memory access fatal error.

expression is declared by the statement:

std::string expression = std::string("^(") +
    std::string("[\x09\x0A\x0D\x20-\x7E]|") + // ASCII
    std::string("[\xC2-\xDF][\x80-\xBF]|") + // non-overlong 2-byte
    std::string("\xE0[\xA0-\xBF][\x80-\xBF]|") + // excluding overlong
    std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte
    std::string("\xED[\x80-\x9F][\x80-\xBF]|") + // excluding surrogates
    std::string("\xF0[\x90-\xBF][\x80-\xBF]{2}|") + // planes 1-3
    std::string("[\xF1-\xF3][\x80-\xBF]{3}|") + // planes 4-15
    std::string("\xF4[\x80-\x8F][\x80-\xBF]{2}") + // plane 16
    ")*$";

This regex was taken from http://www.w3.org/International/questions/qa-forms-utf-8 to test whether a byte sequence is UTF8.

Is this actually a bug in the library, or am I missing something really tiny?

Compiled with VS2015 c++, if that happens to make a difference.

EDIT: I forgot to mention that there is one specific line in this that breaks the code. std::string("[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|") + // straight 3-byte is the only line that breaks. comment that out and it works fine. This line on it's own creates a memory access error.

You need to double escape backslashes in a non-raw string literal. Best is to use raw string literals. Try replacing all \x with \\x. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 8, 2016 at 19:14
C++ requires escapes to be escaped in string literals, unless using the new raw syntax. — user557597
– user557597, Commented Feb 8, 2016 at 19:25
In that example you site, it uses \A ,, \z for a reason. You shouldn't use ^ .. $ as it's not the same. — user557597
– user557597, Commented Feb 8, 2016 at 19:48

score 1 · Accepted Answer · 2016-02-08 19:37:41Z

1

So, if you use escapes in string literals, without using raw syntax,
you have to escape the escapes.

Example, new string:

std::string expression = std::string("^(") +
    std::string("[\\x09\\x0A\\x0D\\x20-\\x7E]|") + // ASCII
    std::string("[\\xC2-\\xDF][\\x80-\\xBF]|") + // non-overlong 2-byte
    std::string("\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|") + // excluding overlong
    std::string("[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|") + // straight 3-byte
    std::string("\\xED[\\x80-\\x9F][\\x80-\\xBF]|") + // excluding surrogates
    std::string("\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|") + // planes 1-3
    std::string("[\\xF1-\\xF3][\\x80-\\xBF]{3}|") + // planes 4-15
    std::string("\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}") + // plane 16
    ")*$";

When you don't escape them, the compiler tries to interpret it as a
special character. In this case it is interpreting those as hex binary characters.

And, while the regex engine probably gets the right character,
it is always better to pass hex to the engine so you can see the character
that might break it (if it does).

edited Feb 8, 2016 at 19:37

answered Feb 8, 2016 at 19:30

user557597

Sign up to request clarification or add additional context in comments.

7 Comments

James Over a year ago

does that explain why only one line crashes the program though? I can comment out the line that breaks it, leaving the single slashes in all the rest, and the code will run fine. While the double slash information is good to know, it doesn't fix the fact that the program will crash.

user557597 Over a year ago

Hang on a second. I use VS2010, I'm going to run your ascii string and see if it crashes. Could be a trigraph issue..

user557597 Over a year ago

I get this in the debugger

^([	   -~]|[Â-ß][€-¿]|à[ -¿][€-¿]|[á-ìîï][€-¿]{2}|í[€-Ÿ][€-¿]|ð[-¿][€-¿]{2}|[ñ-ó][€-¿]{3}|ô[€-][€-¿]{2})*$

and it doesn't crash my app.

James Over a year ago

This wasn't a problem under VS2010. It popped up in a VS2015 conversion. Thank you though.

user557597 Over a year ago

Yeah, for example \x80 turns into the glyph U+20AC in extended ascii, using (my) default locale. And like, "[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|" turns into [\x{E1}-\x{EC}\x{EE}\x{EF}][\x{20AC}-\x{BF}]{2}| on my machine. That's probably why it's never a good idea to use literal's beyond ascii in regular expressions. Use "\\x00" will be better.

|

Collectives™ on Stack Overflow

std::regex fatal error

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related