How/when to release memory in wait-free algorithms

Question

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?

The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?

The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.

IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).

PS, references to advanced tutorials on wait-free algorithms for morons are welcome.

Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.

The folks around the Linux kernel are keenly interested in lockless algorithms, particularly RCU. See Documentation/RCU/ in your kernel sources. — vonbrand
– vonbrand, Commented Mar 8, 2014 at 2:01
@vonbrand thanks, I totally forgot that, it was a LONG time since I had a look at RCU, I'll have to re-read the docs again to see how it factors into this. — Martin
– Martin, Commented Mar 8, 2014 at 2:32
You asked, What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count? The answer is nothing at all. If threads break the rules, then all bets are off. After all, a lock or mutex is a cooperative mutual exclusion device. Lock free algorithms also depend on cooperation. — Jim Mischel
– Jim Mischel, Commented Mar 8, 2014 at 4:11
@JimMischel My question is more along the lines of "how do wait-free algorithms ensure this doesn't happen", i.e. what method is used to induce cooperation. Specifically, if you have a pointer, how do you ensure the data isn't freed after the value is copied in a different thread without a lock. With a mutex it's easy, but how do you do it without one. — Martin
– Martin, Commented Mar 8, 2014 at 4:16
You could use tombstones to reduce the size of the memory leak, but I believe that the state of the art is concurrent GC. I have no idea how it works. — David Eisenstat
– David Eisenstat, Commented Mar 8, 2014 at 5:29

Martin · Accepted Answer · 2014-03-16 23:41:19Z

After some research I learned this.

The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.

Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations

Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.

Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.

There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....

There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.

And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.

So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.

@KhaledAKhunaifer Maybe, but nobody gave me a real answer so I summarized my findings so far so anyone interested might get a head start. Then again, IMHO it does explain how it can be done. If you care to give me a proper answer, I'd love to see your input. :-)
CAS is not wait-free because at most one thread makes progress at a time and succeeds with the CAS while the others fail and must retry. Wait-free means all threads make progress. Lock xadd is a simple example of wait-free, as all threads executing it succeed.

Mårten Wikström · Accepted Answer · 2014-03-13 09:58:30Z

0

I think atomic operations for increment/decrement and compare-and-swap would solve this problem.

Idea:

All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.

You haven't specified a language for your question. Here goes an example in c#:

class MyResource
{
    // Counter is initially zero. Resource will not be disposed until it has
    // been acquired and released.
    private int _counter;

    public bool Acquire()
    {
        // Atomically increment counter.
        int c = Interlocked.Increment(ref _counter);

        // Resource is available if the resulting value is greater than zero.
        return c > 0;
    }

    public bool Release()
    {
        // Atomically decrement counter.
        int c = Interlocked.Decrement(ref _counter);

        // We should never reach a negative value
        Debug.Assert(c >= 0, "Resource was released without being acquired");

        // Dispose when we reach zero
        if (c == 0)
        {
            // Mark as disposed by setting counter its minimum value.
            // Only do this if the counter remain at zero. Atomic compare-and-swap operation.
            if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
            {
                // TODO: Run dispose code (free stuff)
                return true; // tell caller that resource is disposed
            }
        }

        return false; // released but still in use
    }
}

Usage:

// "r" is an instance of MyResource

bool acquired = false;

try
{
    if (acquired = r.Acquire())
    {
        // TODO: Use resource
    }
}
finally
{
    if (acquired)
    {
        if (r.Release())
        {
            // Resource was disposed.
            // TODO: Nullify variable or similar to let GC collect it.
        }
    }
}

edited Mar 13, 2014 at 9:58

answered Mar 13, 2014 at 9:49

Mårten Wikström

11.4k5 gold badges52 silver badges92 bronze badges

8 Comments

Martin Over a year ago

doesn't explain how to prevent memory release if you dereferenced a pointer and updated the counter.

Mårten Wikström Over a year ago

Doesn't it? I think it does.

Mårten Wikström Over a year ago

@Martin: If you would acquire a resource and dereference it without releasing it first, then you would break the usage pattern. Is this something that you need to support? I don't see any reason for that. Please explain.

Martin Over a year ago

if you dereference after acquiring, you're right, but you can't call acquire without having a pointer, e.g. in a register. that's a low-level dereference without acquiring, after which you acquire, then do the proper high-level dereference. (assume the language is C/C++)

Mårten Wikström Over a year ago

@Martin: But you should not have any pointers to the resource unless you have acquired access to it. Otherwise the pointer itself would be the resource, right?

|

Spektre · Accepted Answer · 2014-03-13 10:08:57Z

0

I know this is not the best way but it works for me:

for shared dynamic data-structure lists I use usage counter per item

for example:

struct _data
 {
 DWORD usage;
 bool  delete;
 // here add your data
 _data() { usage=0; deleted=true; }
 };
const int MAX = 1024; 
_data data[MAX];

now when item is started to be used somwhere then
```
// start use of data[i]
data[i].cnt++;
```
after is no longer used then
```
// stop use of data[i]
data[i].cnt--;
```

if you want to add new item to list then

// add item
for (i=0;i<MAX;i++) // find first deleted item
 if (data[i].deleted)
  {
  data[i].deleted=false; 
  data[i].cnt=0;
  // copy/set your data
  break;
  }

and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)

[Note]

to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
```
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
```

now any change of data[i].cnt modify like this:

// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;

and modify delete scan like this

while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
 if (!data[i].cnt)
  if (!data[i].deleted)
  {
  data[i].deleted=true; 
  data[i].cnt=0;
  // release your dynamic data ...
  }
data_cnt_locked=false;

PS.

do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

edited Mar 13, 2014 at 10:08

answered Mar 13, 2014 at 9:46

Spektre

52.2k12 gold badges125 silver badges413 bronze badges

19 Comments

Spektre Over a year ago

btw after moving to W7 platform for more robust applications I use locks with access order stack (do not change the access order) so slower unlucky threads will not freeze to death because on W7 many of my old lock free multi-thread programs stops working safely (scheduller behaviour has changed somehow) especially with USB device drivers usage

Martin Over a year ago

doesn't solve the problem of taking a reference before incrementing the counter in a wait-free manner. A big lock is a big problem. Small locks in the data structure won't work correctly (reference before increment).

Spektre Over a year ago

Yes of course you can not use reference before cnt update. if you do than it is all broke ... If you want to use references without any kind of management than you can not solve this by any means !!!

Martin Over a year ago

But the problem is that in non-managed languages you have to do the management yourself, and my question is how to do the management. As I said, using the GC is fine, if you explain how the GC knows that you have a pointer in a register.

Spektre Over a year ago

the delete scan (last code block) is the GC !!! and the rest is the management arround. I do not use any kind of build in GC ... only mine own

|

Collectives™ on Stack Overflow

How/when to release memory in wait-free algorithms

3 Answers 3

2 Comments

8 Comments

19 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

8 Comments

19 Comments

Your Answer

Sign up or log in

Post as a guest

Related