0

I have various data that I need to parse and get the weight out of it.

I'm using

  • C++11
  • std::regex
  • Debian 9.9
  • gcc 6.3.0

The problem is that sometimes segmentation fault occurs, it happens very rarely.

The input that throws the error mostly consist of just space and newline characters.

Here is the regex:

(?:\b(?:(kilogram\.*s*\.*|kg\.*s*\.*)(?:[^[:alnum:]])*)(?:\s*weight\s*)*(?:\s*is\s*|\s*are\s*)*)\W*([\d\.,]*\d+\b)|(?:(?:[\s\.]?|^)([\d\.,]*\d+)\W*(kilogram\.*s*\.*|kg\.*s*\.*)\b)

Example regex that works on regex101.com but throws segmentation fault in C++ on my Debian server regex101

Here are some more regex101 examples of input, just to fast get an idea of what regex is searching for.

Here is an example of C++ code that fails.

And here is the same C++ code that works, but using another online compiler (cpp.sh).

Can someone please help me to solve this segmentation fault problem?

Thank you.

7
  • (?:\s*weight\s*)* is a killing pattern causing too much backtracking. Commented Jul 5, 2019 at 14:57
  • Try regex101.com/r/kXrDeD/2 Commented Jul 5, 2019 at 15:02
  • @WiktorStribiżew Thank you, but the problem still occurs :( I'm trying to modify somehow but no success :( Commented Jul 5, 2019 at 15:26
  • Also, match13 is suppose to be .15 (0.15), so it needs to include the dot. Commented Jul 5, 2019 at 15:29
  • I shared a PCRE regex, \b(k(?:ilogram|g)\.*s*\.*)\W*(?:\s+weight)?(?:\s+(?:is|are))?\W*(\d[0-9.,]*\b)|[\s.]?(\d[0-9.,]*)\W*(k(?:ilogram|g)\.*s*\.*)\b is the ECMAScript compatible. See demo. Commented Jul 5, 2019 at 16:24

1 Answer 1

2

I have the same issue with a simple regex .+ and [a-zA-Z0-9\\+/=]+.

I have tried different compilers: g++, clang++, clang-cl on Windows, and g++, clang++ on Linux (WSL).

On Windows, the application freeze and ends suddenly. On Ubuntu (WSL), I have the Segmentation Fault.

The error happens for g++ on Windows with c++11, c++14, c++17 and also c++20.

Limit

In your example, your data regex101 has 31275 characters which, I suppose, is too many for regex_match.

Here is the program I used to guess the maximal length of the data.

#include <iostream>
#include <regex>

int main(int argc, char **argv) {
    int length = argc > 1 ? std::stoi(std::string(argv[1])) : 30000;
    std::regex testRegex(".+");
    std::string data = "";
    for (int i = 0; i < length; ++i) {
        data += "a";
    }
    std::cout << "Match: " << std::regex_match(data, testRegex) << std::endl;
    return 0;
}

// Limit before crash (it's a bit random so the limit is not accurate)

// Windows 11
// clang++ Windows : 4999998
// clang-cl Windows : 4999998
// g++ Windows : 6833

// WSL Ubuntu 20.04
// clang++ WSL : 23804
// g++ WSL : 26187

How to solve

According to this test, data has a size limit, and the application will stop if the limit is exceeded.

What you can do is:

  • Remove some unnecessary spaces before using regex_match
  • Split the data in half
  • On Windows, you can use clang++ to increase the limit to 5M chars

For me, I split my data in half because the regex [a-zA-Z0-9\\+/=]+ doesn't require the entire input.

If anybody knows how we can increase the limit (with some flags or #define), I am interested.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.