1

I am working on a C++11 application that is supposed to ship as a single executable binary file. Optionally, users can provide their own CSV data files to be used by the application. To simplify things, assume each element is in format key,value\n. I have created a structure such as:

typedef struct Data {
    std::string key;
    std::string value;

    Data(std::string key, std::string value) : key(key), value(value) {}
} Data;

By default, the application should use data defined in a single header file. I've made a simple Python script to parse default CSV file and put it into header file like:

#ifndef MYPROJECT_DEFAULTDATA
#define MYPROJECT_DEFAULTDATA

#include "../database/DefaultData.h"

namespace defaults {
    std::vector<Data> default_data = {
        Data("SomeKeyA","SomeValueA"),
        Data("SomeKeyB","SomeValueB"),
        Data("SomeKeyC","SomeValueC"),

        /* and on, and on, and on... */

        Data("SomeKeyASFHOIEGEWG","SomeValueASFHOIEGEWG")
    }
}

#endif //MYPROJECT_DEFAULTDATA

The only problem is, that file is big. I'm talking 116'087 (12M) lines big, and it will probably be replaced with even bigger file in the future. When I include it, my IDE is trying to parse it and update indices. It slows everything down to the point where I can hardly write anything.

I'm looking for a way to either:

  1. prevent my IDE (CLion) from parsing it or
  2. make a switch in cmake that would use this file only with release executables or
  3. somehow inject data directly into executable
17
  • 1
    And you can't ship the actual "default value" CSV file along with the executable, so it can be read if no other data file is loaded? Then how about create a single long string containing the actual contents of the file, include it much like your vector, and that string is is parsed at startup? Commented Nov 2, 2016 at 15:04
  • 1
    So you want to give your users a single executable, but you need to do a different build per user to incorporate their data? (After they have given you the data file)? What problem are you really trying to solve? It is usual to store data in a data file/database (for a reason) Commented Nov 2, 2016 at 15:07
  • 2
    Regarding "performance", is the startup-time that important? Or is it more important that it "performs" once it's started? How often will the program be started? Several times a day? Once a day? Once a week? How long will it run once started? A few minutes? Hours? Days? I'm just asking because any other method I know how to "embed" the file into the executable, relies on storing the raw unparsed data. Maybe your generated data source file should be built into a separate library, externally, and without the IDE really touching it? Commented Nov 2, 2016 at 15:20
  • 1
    Managing the external "library" is (relatively) easy using the add_custom_command and add_custom_target commands. You don't really need a library, just an object file that you add to your main build. As long as you don't open the auto-generated source file, and put it in a separate directory that is excluded from CLion, then you should not have a problem. Commented Nov 2, 2016 at 15:43
  • 2
    @Someprogrammerdude: I didn't say that vector would allocate more than once. Each individual std::string within that vector may have its own allocations, depending on the sizes of the strings and the presence of SSO on that std::string implementation. Plus, there's the fact that each string construction also will copy out of the string literal. Commented Nov 2, 2016 at 16:01

1 Answer 1

3

Since your build process already includes a pre-process, which generates C++ code from a CSV, this should be easy.

Step 1: Put most of the generated data in the .cpp file, not a header.

Step 2: Generate your code so that it doesn't use vector or string.

Here's how to do these:

struct Data
{
    string_view key;
    string_view value;
};

You will need an implementation of string_view or a similar type. While it was standardized in C++17, it doesn't rely on C++17 features.

As for the data structure itself, this is what gets generated in the header:

namespace defaults {
    extern const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data;
}

{{GENERATED_ARRAY_COUNT}} is the number of items in the array. That's all the generated header should expose. The generated .cpp file is a bit more complex:

static const char ptr[] =
    "SomeKeyA" "SomeValueA"
    "SomeKeyB" "SomeValueB"
    "SomeKeyC" "SomeValueC"
    ...
    "SomeKeyASFHOIEGEWG" "SomeValueASFHOIEGEWG"
;

namespace defaults 
{
  const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data =
  {
      {{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
      {{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
      ...
      {{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
  };
}

ptr is a string which is a concatenation of all of your individual strings. There is no need to put spaces or \0 characters or whatever between the individual strings. However, if you do need to pass these strings to APIs that take NULL-terminated strings, you'll either have to copy them into a std::string or have the generator stick \0 characters after each generated sub-string.

The point is that ptr should be a single, giant block of character data.

{{GENERATED_OFFSET}} and {{GENERATED_SIZE}} are offsets and sizes within the giant block of character data that represents a single substring.

This method will solve two of your problems. It will be much faster at load time, since it performs zero dynamic allocations. And it puts the generated strings in the .cpp file, thus making your IDE cooperate.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.