Optimize lexer/parser bottleneck in C++

Question

I am writing a parser for a custom mesh format for fluid dynamics simulation library, the mesh file contains 3D points (vertices) for the simulation mesh, for example:

[points 4]
2.426492638414711e-07,-0.0454127835577514,0.737590325020352
-0.02408935296003224,-0.02309953378412839,0.7378945938955059
-1.6462459712364876e-07,-0.02312891146336533,0.7381839359073152
0.024084588772487963,-0.02310255971887,0.737895047277951

My parser works okay, but it's awfully slow. I used a profiler to find the bottlenecks of my code, and I found that the following function alone takes up 66% of my CPU time.

Token Scanner::ScanNumber()
{
    // We are here because this->unscannedChar contains a start of a number.
    // A number can have different forms: 23, 3.14159265359, 6.0221409e+23, .0001

    std::string ScannedNumber;

    // Add current unscanned char
    ScannedNumber += this->unscannedChar;

    this->NextChar();

    // This might allow wrongs decimal formats to be scanned, like for example: 2.3..4, 1e3e3, 1e6--6e-
    // But since we are depending on std::from_chars to convert the string representation to a real number,
    // and also the mesh file will not be generated by a human, so such wrong formats are very unlikely
    // we are going to stick to this abomination for now.
    while (
        std::isdigit(this->unscannedChar) ||
        this->unscannedChar == '.' ||
        this->unscannedChar == 'e' ||
        this->unscannedChar == '-'
        )
    {
        ScannedNumber += this->unscannedChar;
        this->NextChar();
    }

    Token token;
            
    token.type = TokenType::NUMBER;
    token.data = std::move(ScannedNumber);

    return token;

}

And this is Token definition:

struct Token
{
    TokenType   type;
    String      data;
};

It's worth to note that NextChar() is not a concern at all according to the profiler (it handles around 2 million characters in 439 milliseconds), and strangely the lookup for e character is taking most of the function time.

Would appreciate the review of ScanNumber() and any tips to make it faster, to handle points in the range of millions.

Perhaps your profiler sees your code spending a lot of time checking for 'e' simply because that's a common case, and the actual bottleneck is with your while loop in general? — William Bradley
– William Bradley, Commented Sep 21, 2021 at 16:44
I think the highlight is just inaccurate (it should cover multiple lines). Judging by the next profiler sample point, it probably covers the whole while loop. — user673679
– user673679, Commented Sep 21, 2021 at 16:59
@user673679 I ran it more than once, and I always get the same line. — Algo
– Algo, Commented Sep 21, 2021 at 17:19
Yeah. That sample point probably covers multiple lines of code on your screen but your IDE may not track that, or be able to highlight multiple lines properly. If you find the right screen, the profiler should be able to give you much more detail about what's included in that 59.53%). — user673679
– user673679, Commented Sep 21, 2021 at 17:30
If I was to have a speculative guess, appending to a string is not performant. It may be causing many reallocations that are not necessary. You could try putting the characters into a plenty big enough char array, and then converting that to a string once it has a full number in it. — spyr03
– spyr03, Commented Sep 22, 2021 at 2:30

G. Sliepen · Accepted Answer · 2021-09-22 09:32:36Z

Unnecessary use of `this->`

In C++, you almost never have to use this-> in your code.

Potentially lots of function call overhead

The function ScanNumber() will call NextChar() a lot of times. Maybe it is a simple function that can be inlined, but if not this can impact performance a lot. Also consider that it doesn't return a value, but is storing the value in a member variable of Scanner. Again, the compiler might not be able to optimize this as well as if NextChar() would return a value that would simply be stored in a register.

Parse directly to a floating point number

The function ScanNumber() doesn't parse the number, but just returns a string. If this string is too large for the small string optimization used by std::string, it means it has to do a memory allocation, which is again bad for performance. Regardless, yet another function has to actually parse the string to produce the floating point number you want.

If you would parse the number directly inside ScanNumber() and would store it as a float or double inside Token, that would be a lot more efficient. Even better would be to avoid scanning character by character; if ScanNumber() would have access to the whole line that was read from the input file, it could just call std::stof() or std::from_chars() and have the standard library do the parsing for you.

Your struct Token currently only holds a std::string, but if you want it to hold other types, you could do this efficiently using a union. Basically, you will have created a tagged union. If you can use C++17, consider using std::variant to replace your own Token type.

Avoid unnecessary temporary variables.

While a move is better than a copy, it's even better if you don't need to move at all. Instead of first scanning a number into the variable ScannedNumber and then moving it into token.data, just declare token at the start of the function and scan into token.data directly.

Incorrect use of `std::isdigit()`

std::isdigit() and related functions take an int parameter, and expect any characters to be cast to unsigned char first. Just write:

while (std::isdigit(static_cast<unsigned char>(unscannedChar)) || ...

I don't think casting is necessary if unscannedChar is a char or an unsigned char. — jdt
– jdt, Commented Sep 22, 2021 at 13:51
@johan: it's necessary if char is signed and the input might contain bytes with the high-order bit set (eg. UTF-8 code sequences). (And there's no way you can know that it doesn't.) std::isdigit(int) cannot be called with a negative argument, although some standard library implementations let you get away with it (so your first crash will be when your code is linked to a different library). C++ has an overload std::isdigit(char, std::locale) which doesn't suffer from this problem. — rici
– rici, Commented Sep 22, 2021 at 14:45
I like "convert once into token" advice. I would also suggest operator>> might work well, but timing tests would tell that. — Edward
– Edward, Commented Sep 22, 2021 at 15:42

JDługosz · Accepted Answer · 2021-09-22 15:20:07Z

I agree with Johan du Toit's answer that going "old school" would be much faster. But, the real difference is not in using C primitive types and techniques, but in knowingly having a contiguous collection of characters rather than calling NextChar to read one at a time.

Machines are different now than they were in the 60's. For something like a configuration file, it's reasonable to be able to read the entire thing into memory first. For a job input file, you can at least read an entire line at a time. Reading an entire line, you can assume that one token is in the contiguous collection that you've read ahead.

An issue with Johan's code is that input is not const. He modifies the buffer to add a nul character, and then restores it.

You can do something similar in C++, using string_view. Instead of atof, use from_chars which not only does not require a nul terminator, but is implementing using a new algorithm that's much faster, and written to be fast as opposed to the C library which was implemented to use common code.

Write your parser to read a whole line at a time into a std::string which I'll call current_line. Then set a string_view to the entire line, called input. Use input.front() to inspect the next character. The function like ScanNumber will modify input by calling input.remove_prefix to consume input.

Instead of a while loop you can use std algorithms and member functions on string_view like find_first_not_of.

I wonder if C++ facets could help or would it be slower? I don't have the time to check right now. — Edward
– Edward, Commented Sep 22, 2021 at 15:33
It's certainly possible that from_chars is faster. But comparing with atof is not fair; the correct C interface is strtod, which has the advantage of requiring neither the while loop to prescan nor a mutable buffer. Using strtod you get both the double value and a pointer to the terminating character in a single call. (And also overflow indications.) It's evident that OP doesn't want to write a complete number parser, so the best solution is to just use one from the standard library. — rici
– rici, Commented Sep 22, 2021 at 15:57
Note that strtod will use the current locale. You generally don't want to do that when reading a data file as it's best to define interchange formats to be fixed, not complain when it sees a dot when your machine is set to use a comma, etc. — JDługosz
– JDługosz, Commented Sep 23, 2021 at 17:07

jdt · Accepted Answer · 2021-09-22 15:40:40Z

Go old school!

For parsers where performance is critical, you really can’t beat using old-school C strings. Considering the following code:

char* scanNumber(char* input, double& val) {
    if (isdigit(*input) || *input == '-' || *input == '.') {
       char* ptr = input + 1;
       while (isdigit(*ptr) || *ptr == '-' || *ptr == '.' || *ptr == 'e')
          ptr++;
       char temp = *ptr;
       *ptr = '\0';
       val = atof(input);
       *ptr = temp;
       return ptr;
    }
    return 0;
}

There is no need to copy characters. Simply null-terminate the string where the number ends, use atof to read the number, and then reset the last character.

The ctype functions in MSVC are notoriously slow (See this link). You may consider creating your own versions something like this:

bool IsDigit(char c) {
    return c >= '0' && c <= '9';
}

or even faster:

#define IsDigit(X) (((uint32_t)X - '0') < 10u)

Here is an example of how this can be used:

void parse(char* input) {
    double val;
    char* next;
    for (char* ptr = input; *ptr; ) {
        next = scanNumber(ptr, val);
        if (next != 0) {
            addToken(NUMBER, val);
            ptr = next;
            continue;
        }
        switch (*ptr) {
            case ' ':
            case '\t':
            case '\r':
            case '\n':
                // skip whitespace
                ptr++;
                break;
            case ',':
                addToken(SEPERATOR);
                ptr++;
                break;
            default:
                throw std::exception("Invalid character");
        }
    }
}

Stack Exchange Network

Optimize lexer/parser bottleneck in C++

3 Answers 3

Unnecessary use of `this->`

Potentially lots of function call overhead

Parse directly to a floating point number

Avoid unnecessary temporary variables.

Incorrect use of `std::isdigit()`

You must log in to answer this question.

Hot Network Questions

Optimize lexer/parser bottleneck in C++

3 Answers 3

Unnecessary use of this->

Potentially lots of function call overhead

Parse directly to a floating point number

Avoid unnecessary temporary variables.

Incorrect use of std::isdigit()

You must log in to answer this question.

Related

Hot Network Questions

Unnecessary use of `this->`

Incorrect use of `std::isdigit()`