I am writing a cuda kernel which requires me to allocate an array of aligned struct on the device.
I am getting the correct results from my computations and I need to write the values to this array starting from index 0.
When I try to write to this array and display the results back to host side, some of the answers are displayed as zero.
Clearly, I am not increasing the index as per my requirement. I tried using counter which I increase using atomicAdd(), however I still get some values as zero.
To be precise, I may use 1000 threads in my kernel for computations but my output allocated array can have a size less than 100 or more than 10000.
My question is, how do I make all these threads write the value to exactly one location of array ( as they are calculated ) and increment the array index/counter by 1 without overwriting it.
Any help will be appreciated.Thanks in advance.