C++ Minimal CSV parser

Question

I wrote a minimal CSV parser for my machine learning toy project.

class CSV {
public:
    using data_type = std::variant<int32_t, float, std::string>;
    using row_type = std::vector<data_type>;

private:
    std::vector<row_type> rows;
    std::vector<row_type> columns;
    std::vector<std::string> column_names;
    std::vector<int> column_data_types;
public:
    CSV(std::istream& is) {
        ParseCSV(is);
    }

    int32_t to_int(const std::string& str) {
        int32_t result = 0;
        auto [ptr, ec] = std::from_chars(str.data(), str.data() + str.size(), result);
        return result;
    }

    float to_float(const std::string& str) {
        float result = 0;
        auto [ptr, ec] = std::from_chars(str.data(), str.data() + str.size(), result);
        return result;
    }

    data_type to_data(const std::string& str, std::size_t col_index) {
        switch (column_data_types[col_index]) {
            case 0:
                return to_int(str);
                break;
            case 1:
                return to_float(str);
                break;
            case 2:
                return str;
                break;
            default:
                float result = 0.0f;
                auto [ptr, ec] = std::from_chars(str.data(), str.data() + str.size(), result);
                if (ec == std::errc::invalid_argument) { // this data type is a string
                    column_data_types[col_index] = 2;
                    return str;
                } else if (result == std::floor(result)) { // this may be an integer
                    column_data_types[col_index] = 0;
                    return to_int(str);
                } else {
                    column_data_types[col_index] = 1;
                    return result;
                }
                break;
        }
        assert(0);
        return {};
    }

    void ParseHeader(std::istream& is, std::streamsize& len) {
        constexpr std::streamsize buf_len = 1024;
        std::array<char, buf_len> buffer = {0};
        std::string cell;
        while (len > 0) {
            std::streamsize read_len = std::min(len, buf_len);
            is.getline(&buffer[0], read_len);
            for (int i = 0; i < read_len; ++i) {
                if (buffer[i] == ',') {
                    column_names.push_back(std::move(cell));
                    cell = "";
                } else if (buffer[i] == '\n' || buffer[i] == '\0') {
                    column_names.push_back(std::move(cell));
                    cell = "";
                    len -= i;
                    column_data_types.resize(column_names.size(), -1);
                    return;
                } else {
                    cell += buffer[i];
                }
            }
            len -= read_len;
        }
        column_data_types.resize(column_names.size(), -1);
    }

    void ParseCSV(std::istream& is) {
        is.seekg(0, std::ios::end);
        std::streamsize len = is.tellg();
        is.seekg(0, std::ios::beg);

        auto t1 = std::chrono::steady_clock::now();

        ParseHeader(is, len);

        constexpr std::streamsize buf_len = 1024;
        std::array<char, buf_len> buffer = {0};
        std::string cell;
        row_type row;
        std::size_t col_index = 0;
        while (len > 0) {
            std::streamsize read_len = std::min(len, buf_len);
            is.read(&buffer[0], read_len);
            for (int i = 0; i < read_len; ++i) {
                if (buffer[i] == ',') {
                    row.push_back(to_data(cell, col_index));
                    cell = "";
                    col_index++;
                } else if (buffer[i] == '\n') {
                    row.push_back(to_data(cell, col_index));
                    cell = "";
                    col_index = 0;
                    rows.push_back(std::move(row));
                    row = {};
                } else {
                    cell += buffer[i];
                }
            }
            len -= read_len;
        }
        auto t2 = std::chrono::steady_clock::now();
        auto dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1);
        std::cout << rows.size() << " element read in " << dt.count() << " us\n";
    }
};

Reading 20628 lines from https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz gives, in my machine:

20628 element read in 89734 us

Feel free to comment anything!

I'm wondering why you have a 2-D array for rows and for columns. Does it hold the same data twice? — JDługosz
– JDługosz, Commented Sep 27, 2021 at 14:23
Incorporating advice from an answer into the question violates the question-and-answer nature of this site. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. — Toby Speight
– Toby Speight, Commented Sep 28, 2021 at 8:52
Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered. — pacmaninbw
– pacmaninbw ♦, Commented Sep 28, 2021 at 12:28

JDługosz · Accepted Answer · 2021-09-27 15:39:03Z

Your primitive conversion functions don't report errors. Since your data type is a variant anyway, you could include an error type among them and return an error.

Use of from_chars is good! These are new, efficient functions.

to_data uses hard-coded numbers in the switch statement and elsewhere. It would be better to use an enumeration type.

You don't need to break after return. The return itself is enough!

Why are you using fixed-size buffers for reading the lines?

cell = ""; use {} for an empty string. Converting "" is inefficient because it uses the general string literal constructor! But in this case, just use cell.clear();.

The code in ParseHeader is pretty slow and laborious. You're reading one character at a time, using a loop over i the index for that matter (use a range-based for loop to do it directly), and pushing the character onto a string until you see the comma. I scan for the comma or separators using standard algorithms and then grab the whole token at once.

I see ParseCSV is doing the same thing. You don't need to copy the chars into another string, one-at-a-time or otherwise! Find the delimiter, and then you have two pointers to the start and end of the token. That's what from_chars wants!

It looks like you determine the cell type when reading the first row. That might give false results, if the first value is "1" but it actually supports floating point so the next row has "2.4". Or, it's not an integer but actually a string, as the second row has "B".

Here's my code that splits a string up at the commas. It needs to know the number of fields at compile-time and thus avoids any dynamic memory allocation, but look at the body of the for loop for the actual logic; you can adapt that to whatever collections you are using.

template <size_t N>
auto split (char separator, std::string_view input)
{
    std::array<std::string_view, N> results; 
    auto current = input.begin();
    const auto End = input.end();
    for (auto& part : results)
    {
        if (current == End) {
            const bool is_last_part = &part == &(results.back());
            if (!is_last_part)  throw std::invalid_argument ("not enough parts to split");
        }
        //  🔻🔻🔻 core logic is here
        auto delim = std::find (current, End, separator);
        part = { &*current, size_t(delim-current) };
        current = delim;
        //  🔺🔺🔺
        if (delim != End) ++current;
    }
    if (current != End)  throw std::invalid_argument ("too many parts to split");
    return results;
}

architecture

You code has:

reading from a file
separating at delimiters
converting the tokens to the desired type
storing those converted tokens

all mixed together in one piece of code. You may notice from my sample that it only does point 2. Where that line of text came from is irrelevant, and subsequent points are not its responsibility.

Input doesn't have to come from a file or even a istream object: it can scan a memory-mapped file that's already visible in RAM, or lines already in an array that came across via some message protocol, or literal constants, or anything. The code does not need to be configured for this because it simply does not matter to what this code does, so none of that is part of it. It takes a string_view to process one line.

I don't store the results in a two-dimensional array of variants, either. I assign a row directly to a normal C++ structure that has named members with full type information. Handling the conversion of strings to the proper types is another, general purpose, piece of code that's unrelated to the task of separating CSV fields. Adding a user-defined type or other previously unused type does not require changing this code, because conversions is not part of it.

Separations of Concerns is good. Even if you don't need flexibility, it makes each task simple and testable in isolation.

Can you benchmark your solution? I tried something similar after reading in a whole line, but it was less efficient than OP's solution. — G. Sliepen
– G. Sliepen, Commented Sep 27, 2021 at 15:10
@G.Sliepen Yes it's fast. That's because the separator is a single char, not a string or choice of characters, and because it doesn't allocate any memory on the heap nor copies the strings it finds. I didn't count the time needed to read a line from the file; that is a separate concern and this can be called with a whole file that's been slurped in and "separated" at newline, or an array of strings given to my code already in that form. — JDługosz
– JDługosz, Commented Sep 27, 2021 at 15:19
@JDługosz Thanks for your feedback, but when I've tried your solution (using std::string_view), it turned out to be slower than the original solution for the reason I don't know. I'll post my two implementations and benchmark result — frozenca
– frozenca, Commented Sep 28, 2021 at 2:15
@G.Sliepen (and JDlugosz) Wait, I've benchmarked wrongly. The std::string_view approach definitely improved performance hugely. Posted the result — frozenca
– frozenca, Commented Sep 28, 2021 at 2:28

G. Sliepen · Accepted Answer · 2021-09-27 15:08:04Z

It's surprisingly efficient. If I were to implement it myself, I would have gone with a while(std::getline(...)) loop, so I always know I am dealing with a whole line at a time, and can pass a std::string_view into the line buffer instead of making a copy of each item into cell. But this was more than twice as slow as your solution, even if I reduce buf_len to just 16. There are a few improvements that could be made that I'll list below, but nothing really impacted performance.

Use an `enum class` for the possible types

Instead of using raw integers to represent the type, create an enum class for it:

enum class type {
    UNKNOWN,
    INT,
    FLOAT,
    STRING,
};

std::vector<type> column_data_types;

This avoids magic constants and having to remember the index into data_type for a given type.

Remove unused variables

The member variable columns is unused, as well as the local variables ptr and ec in to_int() and to_float(). Just remove them.

Passing string iterators

Instead of calling std::from_chars() with str.data() and str.data() + str.size() as the parameters, just pass str.begin() and str.end().

Avoid unnecessary copies

Inside ParseCSV() you build a row in the temporary variable row, and then std::move() that into rows. While this is rather efficient since std::vector has a move constructor, you can avoid this altogether by adding an empty row to rows first, and then add cells to that row:

row.emplace_back(); // Ensure there is an empty row
...
while (len > 0) {
    ...
    if (buffer[i] == ',') {
        rows.back().emplace_back(to_data(cell, col_index));
        ...
    } else if (buffer[i] == '\n') {
        rows.back().emplace_back(to_data(cell, col_index));
        rows.emplace_back(); // This row is done, add a new one
        ...
    } else ...

Consider making it more generic

It would be possible to reduce the amount of code you have to write somewhat by making more use of the type system. Instead of column_data_types being just a vector of indices into data_type, just make it a vector of data_type itself:

std::vector<data_type> column_data_types;

Then, instead of having to_data() having a switch-statement to call the right to_something() function, create a template to<T>() with overloads for the types you want to support, and to_data() use std::visit() to call the right overload:

template<typename T>
T to(const std::string &str) = delete;

template<>
int to<int>(const std::string &str) {
    int32_t result = 0;
    std::from_chars(str.begin(), str.end(), result);
    return result;
}

...

data_type to_data(const std::string &str, std::size_t col_index) {
    if (std::holds_alternative(column_data_types[col_index])) {
        return std::visit([&](auto& type) {
             return to<decltype(type)>(str);
        }, column_data_types[col_index]);
    }

    // Else check the string to determine what type it is
    ...
    if (/* it's an int */) {
        column_data_types[col_index] = int{};
        return to<int>(str);
    }
    ...       
}

If you just support three types it's not worth it, but it might pay off when you want to support more types.

Stack Exchange Network

C++ Minimal CSV parser

2 Answers 2

architecture

Use an `enum class` for the possible types

Remove unused variables

Passing string iterators

Avoid unnecessary copies

Consider making it more generic

You must log in to answer this question.

Linked

Hot Network Questions

C++ Minimal CSV parser

2 Answers 2

architecture

Use an enum class for the possible types

Remove unused variables

Passing string iterators

Avoid unnecessary copies

Consider making it more generic

You must log in to answer this question.

Linked

Related

Hot Network Questions

Use an `enum class` for the possible types