malloc once, then distribute memory over struct arrays

Question

I have a struct that has the following memory layout:

uint32_t  
variable length array of type uint16_t
variable length array of type uint16_t

Because of the variable length of the arrays I have pointers to these arrays, effectively:

struct struct1 {
  uint32_t n;
  uint16_t *array1;
  uint16_t *array2;
};
typedef struct struct1 struct1;

Now, when allocation these structs I see two options:

A) malloc the struct itself, then malloc space for the arrays individually and set the pointers in the struct to point to the correct memory location:

uint32_t n1 = 10;
uint32_t n2 = 20;

struct1 *s1 = malloc(sizeof(struct1));
uint16 *array1 = malloc(sizeof(uint16) * n1));
uint16 *array2 = malloc(sizeof(uint16) * n2));
s1->n = n1;
s1->array1 = array1;
s1->array2 = array2;

B) malloc memory for everything combined, then "distribute" the memory over the struct:

struct1 *s1 = malloc(sizeof(struct1) + (n1 + n2) * sizeof(uint16_t));
s1->n = n1;
s1->array1 = s1 + sizeof(struct1);
s1->array2 = s1 + sizeof(struct1) + n1 * sizeof(uint16_t);

Note that array1 and array2 are not bigger than a few KB and usually not a lot of struct1s are needed. However, cache efficiency is a concern as numeric data crunching is done with this struct.

Is approach B) possible and if so better (faster) than A in terms of memory locality?
I am not very familiar with C, is there a better way of doing B (or A), ie. using memcpy or realloc or something?
Anything else to be mindful about in this situation?

Note, that right now I'm using gcc (C89?) on linux but could use C99/C11 if necessary. Thanks in advance.

EDIT: To clarify further: The size of the arrays will never change after creation. Multiple struct1s will not be always be allocated at once but rather occasionally during the program's runtime.

What you are considering is related to the "struct hack" or in C99 a "flexible array member". These only allow for one variable length array. They also automatically deal with the question of alignment requirements which you may have overlooked. — Avi Berger
– Avi Berger, Commented Nov 25, 2016 at 17:19

Greg Schmit · Accepted Answer · 2016-11-25 20:55:32Z

I think your option A is much cleaner and would scale in a more sensible way. Imagine having to realloc space when the array in one of the structures becomes larger: in option A, you can realloc that memory since it isn't logically attached to anything else. In option B, you need to add in additional logic to ensure you don't break your other array.

I also think (even in C89, but I could be wrong) that there is nothing wrong with this:

struct1 *s1 = malloc(sizeof(struct1));
s1->array1 = malloc(sizeof(uint16) * n1));
s1->array2 = malloc(sizeof(uint16) * n2));
s1->n = n1;

The above takes out the middle-man arrays. I think it is cleaner because you immediately see that you are allocating space for a pointer in a structure.

I have used your option B before for 2D arrays, where I just allocate a single space and use logical rules in my code to use it as a 2D space. This is useful when I want it to be a rectangular 2D space, so when I increase it, I always increase each row or column. In other words, I never want to have heterogeneous array sizes.

Update: 'Arrays will never change in size'

Because you clarified that your structures/arrays will never need to be reallocated, I think option B is less bad. It still seems to be a worse solution for this application than option A, and here are my reasons for thinking this:

malloc is optimized such that there wouldn't be much optimization from allocating a single space compared to allocating the spaces individually.
The ability of other engineers to look at and immediately understand your code would be reduced. To be clear, any competent software engineer should be able to look at option B and figure out what the writer of the code was doing, but it very well could waste that engineers' brain-cycles and could cause a junior engineer to misunderstand the code and create a bug.

So, if you comment the code thoroughly, and your application absolutely requires you to optimize everything you possibly can, at the expense of clean and logically sensible code (where memory space and data structures are logically separated in a similar way), and you know that this optimization is better than what a good compiler (like Clang) can do, then option B could be a better option.

Update: Testing

In the spirit of self-criticism I wanted to see if I could evaluate the difference. So I wrote two programs (one for option A and one for option B) and compiled them with optimizations off. I used a FreeBSD virtual machine to get as clean of an environment as possible, and I used gcc.

Here are the programs that I used to test the two methods:

optionA.c:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define NSIZE   100000
#define NTESTS  10000000

struct test_struct {
    int n;
    int *array1;
    int *array2;
};

void freeA(struct test_struct *input) {
    free(input->array1);
    free(input->array2);
    free(input);
    return;
}

void optionA() {
    struct test_struct *s1 = malloc(sizeof(*s1));
    s1->array1 = malloc(sizeof(*(s1->array1)) * NSIZE);
    s1->array2 = malloc(sizeof(*(s1->array1)) * NSIZE);
    s1->n = NSIZE;
    freeA(s1);
    s1 = 0;
    return;
}

int main() {
    clock_t beginA = clock();
    int i;
    for (i=0; i<NTESTS; i++) {
        optionA();
    }
    clock_t endA = clock();
    int time_spent_A = (endA - beginA);
    printf("Time spent for option A: %d\n", time_spent_A);
    return 0;
}

optionB.c:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define NSIZE   100000
#define NTESTS  10000000

struct test_struct {
    int n;
    int *array1;
    int *array2;
};

void freeB(struct test_struct *input) {
    free(input);
    return;
}

void optionB() {
    struct test_struct *s1 = malloc(sizeof(*s1) + 2*NSIZE*sizeof(*(s1->array1)));
    s1->array1 = s1 + sizeof(*s1);
    s1->array2 = s1 + sizeof(*s1) + NSIZE*sizeof(*(s1->array1));
    s1->n = NSIZE;
    freeB(s1);
    s1 = 0;
    return;
}

int main() {
    clock_t beginB = clock();
    int i;
    for (i=0; i<NTESTS; i++) {
        optionB();
    }
    clock_t endB = clock();
    int time_spent_B = (endB - beginB);
    printf("Time spent for option B: %d\n", time_spent_B);
    return 0;
}

Results for these tests are given in clocks (see clock(3) for more information).

 Series | Option A | Option B
------------------------------
 1      | 332      | 158
------------------------------
 2      | 334      | 155
------------------------------
 3      | 334      | 156
------------------------------
 4      | 333      | 154
------------------------------
 5      | 339      | 156
------------------------------
 6      | 334      | 155
------------------------------
 avg    | 336.0    | 155.7
------------------------------

Note that these speeds are still incredibly fast and translate to milliseconds over millions of tests. I have also found that Clang (cc) is better than gcc at optimizing. On my machine, even after writing a method that writes data to the arrays (to ensure they don't get optimized out of existence) I got no differential between the two methods when compiling with cc.

Wow, thank you very much for your effort! Given the fact, that I probably wont allocated new structs very frequently I will use A until profiling convinces me otherwise.

Paul Ogilvie · Accepted Answer · 2016-11-25 17:08:20Z

2

I would advice a hybrid of the two:

allocate the structs in one call (it is now an array of structs);
allocate the arrays in one call, and make sure the size includes any padding for the allignment required by your compiler/platform;
distribute the arrays over the structs, taking the allignment into acount.

However, malloc is already optimized, so your first solution would still be prefered.

Note: as user Greg Schmit's solution points out, allocating all the arrays in one time, will cause difficulties if the array size needs to be changed in run-time

answered Nov 25, 2016 at 17:08

Paul Ogilvie

25.3k4 gold badges25 silver badges43 bronze badges

1 Comment

elfeck Over a year ago

Thanks for 3. about alignment. Even if not applicable here, it's good to keep it in mind.

Nominal Animal · Accepted Answer · 2016-11-25 18:50:19Z

2

Because the two arrays have the same type, there are much more options than that, based on creative use of the C99 flexible array member. I'd recommend you use a pointer only for the second array,

struct foo {
    uint16_t *array2;
    uint32_t  field;
    uint16_t  array1[];
};

and allocate memory for both at the same time:

struct foo *foo_new(const size_t length1, const size_t length2)
{
    struct foo *result;

    result = malloc( sizeof (struct foo)
                   + length1 * sizeof (uint16_t)
                   + length2 * sizeof (uint16_t) );
    if (!result)
        return NULL;

    result->array2 = result->array1 + length1;

    return result;
}

Note that with struct foo *bar, accessing element i in the two arrays uses the same notation, bar->array1[i] and bar->array2[i], respectively.

In the context of scientific computing, I would consider completely other options, depending on the access patterns. For example, if the two arrays are accessed in lockstep (in any direction), I would use

typedef  uint16_t  pair16[2];

struct bar {
    uint32_t  field;
    pair16    array[];
};

If the arrays were large, then copying them into temporary buffers (arrays of pair16 above, if accessed in lockstep) would possibly help, but with at most a few thousand entries, it is likely not going to give a significant speed boost.

In cases where the access pattern depends on the other, but you still do enough of computation on each entry, it may be useful to compute the address of the next entry early, and use __builtin_prefetch() GCC built-in to tell the CPU you'll need it soon, before doing the computation on the current entry. It may reduce the data access latencies (although the access predictors are pretty darn good on current processors already).

With GCC (and to a lesser extent on other common compilers like Intel Compiler Collection, Portland Group, and Pathscale C compilers), I've noticed that code that manipulates pointers (instead of array pointers and array indexing) compiles to better machine code on x86 and x86-64. (The reason is actually quite simple: with array pointers and array indexing, you need at least two separate registers, and x86 has relatively few of those. Even x86-64 doesn't have that many of them. GCC in particular is not very strong at optimizing register usage -- it's much better now than in the version 3 era --, so this seems to help a lot in some cases). For example, if you were to access the first array in a struct foo sequentially, then

void do_something(struct foo *ref)
{
    uint16_t       *array1 = ref->array1;
    uint16_t *const limit1 = ref->array1 + (number of elements in array1);

    for (; array1 < limit1; array1++) {

        /* ... */

    }
}

answered Nov 25, 2016 at 18:50

Nominal Animal

39.7k6 gold badges63 silver badges95 bronze badges

2 Comments

Greg Schmit Over a year ago

That was smart mentioning __builtin_prefetch(), which I didn't consider (for optimizing the actual calculation part after you have the memory space).

Nominal Animal Over a year ago

@GregSchmit: I've had mixed results with it, though. (I mean, it is not easy to implement an algorithm, and manage to sprinkle them optimally; it is a bit of a try-and-see, because it depends so much on compiler-generated code.) I never worry about malloc() et al. overheads; cache effects and access patterns are so much larger in any real computation tasks. Even optimizing say the calculation of squared distances between points in 3D is useless, unless you get their data arranged optimally: the memory access latencies will be so bad otherwise.

Mike Nakis · Accepted Answer · 2016-11-25 17:08:28Z

Approach B is possible, (why don't you just try it?) and it is better, not so much because of memory locality, but because malloc() costs, so the fewer times you call it, the better off you are. (Assuming that 'better' means 'faster', which admittedly, is not necessarily the case.)

Memory locality is only marginally improved, since all memory blocks would most likely be continuous (one after the other) in memory, so if you went with approach A your arrays would only be separated by block headers, which are not very big. (Of the order of 32 bytes each, maybe a bit larger, but not by much.) The only situation in which your blocks would not be continuous is if you had previously been doing malloc() and free(), so your memory would be fragmented.

Collectives™ on Stack Overflow

malloc once, then distribute memory over struct arrays

4 Answers 4

Update: 'Arrays will never change in size'

Update: Testing

1 Comment

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Update: 'Arrays will never change in size'

Update: Testing

1 Comment

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related