2

OVERVIEW

I have a program that needs to have shared state between several processes (probably 80, I'm working on an embarrassingly parallel problem on a server with 80 cores). Ideally, I would be able to allocate this shared memory in such a way that I could expand the amount of shared state between these processes.

My suspicion is that it's failing because the pointers don't point to actual memory, so if the return value of mmap in one process is 0xDEADBEEF, that doesn't mean that 0xDEADBEEF will point to the same section of memory in another process. However, I know next to nothing about C programming, so that suspicion could easily be wrong.

Could anyone tell me if my suspicion is correct? If so, what should I be doing for shared state instead? The server would take at least 18 days per dataset without using all the cores, and we have quite a few datasets, so giving up on parallel computing is not really an option. However, I am willing to switch from processes to threads or something similar if that would help (I don't know how one would do that in C though). Thanks in advance for your help.

Below is some working and non-working sample code, and the results from gdb.

borked.h

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>

// Expandable array, doubles memory allocation when it runs out of space.
struct earr {
    int *vals;
    int capacity;
    int length;
};

void *shared_calloc(size_t nmemb, size_t size) {
    void *mem = mmap(NULL, nmemb * size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    return memset(mem, 0, nmemb * size);
}

void shared_free_size(void *mem, size_t nmemb, size_t size) {
    if(mem) munmap(mem, nmemb * size);
}

struct earr *create_earr() {
    struct earr *a = shared_calloc(1, sizeof(struct earr));
    a->length = 0;
    a->capacity = 16;
    a->vals = shared_calloc(a->capacity, sizeof(int));
    return a;
}

void earr_expand(struct earr *a) {
    int *new_vals = shared_calloc(a->capacity * 2, sizeof(int));
    memcpy(new_vals, a->vals, a->capacity * sizeof(int));
    a->vals = new_vals;
    a->capacity *= 2;
}

void earr_insert(struct earr *a, int val) {
    if(a->length >= a->capacity) earr_expand(a);
    a->vals[a->length] = val;
    a->length++;
}

int earr_lookup(struct earr *a, int index) {
    return a->vals[index];
}

working.c

#include "borked.h"

int main(void) {
    struct earr *a = create_earr();
    int i;
    pid_t pid;
    int size = 0x10000;
    for(i = 0; i < size; i++) {
        earr_insert(a, i);
    }
    for(i = 0; i < size; i++) {
        earr_lookup(a, i);
    }
    return EXIT_SUCCESS;
}

broken.c

#include "borked.h"

int main(void) {
    struct earr *a = create_earr();
    int i;
    pid_t pid;
    int size = 0x10000;
    if(0 == (pid = fork())) {
        for(i = 0; i < size; i++) {
            earr_insert(a, i);
        }
    } else {
        int status;
        waitpid(pid, &status, 0);
        for(i = 0; i < size; i++) {
            earr_lookup(a, i);
        }
    }
    return EXIT_SUCCESS;
}

GDB debugging

$ gdb broken
...
(gdb) run
Starting program /path/to/broken

Program received signal SIGSEGV, Segmentation Fault
0x08048663 in earr_lookup (a=0xb7fda000, index=0) at /path/to/borked.h:46
46      return a->vals[index];
(gdb) x/3x 0xb7fda000
0xb7fda000: 0xb7da6000  0x00010000  0x00010000
(gdb) x/x 0xb7da6000
0xb7da6000: Cannot access memory at address 0xb7da6000
3
  • Linux, or some other OS? Commented Aug 21, 2014 at 0:04
  • My brute-force solution in these cases is to just use Valgrind. Commented Aug 21, 2014 at 0:05
  • Linux (ubuntu to be specific). Commented Aug 21, 2014 at 7:57

2 Answers 2

3

Your approach is indeed broken. When one of the child processes eventually calls earr_expand to increase the size of the array and ends up calling mmap, the resulting new mapping exists in that child process only - it does not get propagated to the parent process or other child processes.

You'll need to use a different technique, such as pre-allocating the full amount of memory before fork()ing, or using POSIX shared memory.

Sign up to request clarification or add additional context in comments.

1 Comment

I ended up pre-allocating the memory. It turns out it's really fast to just go through a file to check its length and number of lines to find out how much memory I need, so that isn't the bottleneck.
0

Yes, the address of the shared memory varies from process to process.

The solution in general is to use indexes instead of pointers. In the OP's example code, one could use a structure with a C99 flexible array member to describe the shared memory map, something like

struct shared_mmap {
    unsigned int  size;
    unsigned int  used;
    int           data[];
};

The data element type itself can be a structure, of course.

However, growing an anonymous memory map with mremap() will give you process-local new pages; the new pages won't be shared between the processes. You need to use either a file backing (instead of anonymous shared memory), or better yet, POSIX shared memory.

(If you have sufficient amount of RAM available, the file backing is not necessarily slower, as the file pages will typically be in page cache. File backing has the interesting property of allowing a computation to be stopped and continued with same or even different number of processes, obviously depending on the shared memory map structure. If you use MAP_SHARED | MAP_NORESERVE, you can use a memory map much larger than available RAM + swap -- but if you run out of disk space or quota, your process will receive (and die from) a SIGBUS signal.)

Although on x86 and x86-64 int loads and stores are atomic, you'll most likely end up needing (pthread) mutexes and/or atomic built-ins (provided by GCC-4.7 and later; older GCC and ICC et. al. provide legacy sync built-ins.

Finally, if you have an unordered set of ints, I'd consider using a fixed-size (536,870,912 bytes, or 232 bits, in Linux) memory map instead, and atomic bitwise operators:

#include <limits.h>

#define UINT_BITS (CHAR_BIT * sizeof (unsigned int))
#define ULONG_BITS (CHAR_BIT * sizeof (unsigned long))

/* Pointer to shared memory of (1<<UINT_BITS)/CHAR_BIT chars */
static unsigned long *shared;

static inline unsigned int shared_get(unsigned int i)
{
    return !!__atomic_load_n(shared + (i / ULONG_BITS), 1UL << (i % ULONG_BITS), __ATOMIC_SEQ_CST);
}

static inline void shared_set(unsigned int i)
{
    __atomic_fetch_or(shared + (i / ULONG_BITS), 1UL << (i % ULONG_BITS), __ATOMIC_SEQ_CST);
}

static inline void shared_clear(unsigned int i)
{
    __atomic_fetch_nand(shared + (i / ULONG_BITS), 1UL << (i % ULONG_BITS), __ATOMIC_SEQ_CST);
}

In any case, the memory map length must be a multiple of sysconf(_SC_PAGESIZE), which you should evaluate at run time.

(It is a power of two on all known architectures, so reasonably large powers of two, currently 221 and larger, are multiples of the page size on all currently supported Linux architectures.)

Questions?

4 Comments

Thank you. Unfortunately, the actual data structure for this problem is a hashset of long long ints (64 bits), with probably about 50 billion loaded at a time. Still, I can see using that solution for a different problem.
@JoshuaDavid: If you need to store dozens of billions essentially random 64-bit integers, an unsorted linear array is a very poor choice. A good choice depends on your access patterns (and access types). Care to explain how the 64-bit integers are accessed? Just tested if already hashed? Not even tested, just stored? Need to know if the neighboring values or ranges are hashed?
The earr is not my real data structure, it is just the most concise way I found to replicate the problem in a somewhat realistic setting. I am currently using a hash table, as in practice it is faster than a bloom filter (which I also tried). The program compares two files and counts the number of substrings of length k which appear in both files, as well as the number which occur only in the second file.
@JoshuaDavid: Sounds interesting. Anyway, as a reminder: use arrays and indexes, never pointers, in shared memory maps. (If you have variable-length items, use offsets relative to the shared memory map start address, using a type that yields sufficient alignment, and pointer arithmetic.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.