Producer consumer pattern doesn't scale well.
The more producers or consumers you have the worse the performance you get. The reason is that the common queue becomes a bottleneck for the whole system. I hope you see how.
A better approach is to have no common queue; have each consumer have its own queue. When a request comes in have it goes to a load balancer. The load balancer will place the request in the consumer queue that is smallest. The balancer becomes the bottleneck, but it doesn't do a lot of Operations -just picks the right queue to send the incoming request - so it should be darn fast.
Here is an edit to answer your questions:
Problem (more in depth): the more cores you have the slower it gets. Why? Shared memory.
@Peyman use ConcurrentLinkedQueue (which is a non blocking wait free queue where one enqueue and one dequeue can proceed concurrently). Even try it in your initial design and benchmark both designs. I expect your revised design to perform better because you can have only 1 enqueue and 1 dequeue at the same time, as opposed one enqueue and n dequeue as in your initial design (but this is just my speculation).
A great paper on scalable consumer producer by using balancers
Read this page (or can look only at "migrate from the common worker queue approach to the queue-per-thread approach")
Here's a list from http://www.javaperformancetuning.com/news/newtips128.shtml. I think the last 3 points are more applicable to you:
- Most server applications use a common worker queue and thread pool; a shared worker queue holds short tasks that arrive from remote sources; a pool of threads retrieves tasks from the queue and processes the tasks; threads are blocked on the queue if there is no task to process.
- A feeder queue shared amongst threads is an access bottleneck (from contention) when the number of tasks is high and the task time is very short. The bottleneck gets worse the more cores that are used.
- Solutions available for overcoming contention in accessing a shared queue include: Using lock-free data structures; Using concurrent data structures with multiple locks; Maintaining multiple queues to isolate the contention.
- A queue-per-thread approach eliminates queue access contention, but is not optimal when a queue is emptied while there is unprocessed queued data in other queues. To improve this, idle threads should have the ability to steal work from other queues. To keep contention to a minimum, the 'steal' should be done from the tail of the other queue (where normal dequeuing from the thread's own queue is done from the head of the queue).