cache inefficiency, where branch prediction won't work well
These are different things. Linked lists suffer from cache inefficiency:
- Nodes are usually not necessarily allocated contiguously and in order, which is bad for spatial locality. You can sometimes avoid this, for example with custom allocators. With generational garbage collection, allocating nodes closely together in time also tends to put them close together in space, but that's probably not a very common thing to actually happen when using a linked list.
- Having a pointer (and potentially other junk, like an object header and padding) in the node wastes space. Wasting a bunch of space is not inherently super bad, but it is bad when the wasted space is touched, which loads it into the cache. That actually happens here: that pointer to the next node is definitely needed, and the other junk is likely in the same cache line so it gets pulled in as well. This wastes both cache space and bandwidth (both to higher level caches and maybe to memory), which is pretty bad.
Linked lists don't really suffer from branch misprediction inherently. Yes, if you iterate over one, the last branch (the one that exits the loop) has a decent chance of being mispredicted, but that is not specific to linked lists.
How is the pointer access in the linked list slower than the offset access in the array?
Loading a pointer at all is slower than calculating the next address of an element in an array, both in terms of latency and in terms of throughput. For a quick comparison, typical on a modern machine is that loading that point takes around 4 cycles (at best! if there is a cache miss, it takes much longer) and could be done twice per cycle. Adding the size of an array element to the current address takes 1 cycle and can be done 4 times per cycle, and you (or the compiler) may be able to re-use the increment of the loop counter for this with some clever coding. For example, maybe you can use indexed addressing with the loop counter (which is incremented anyway) as index, or you can "steal" the loop counter entirely and increment it by the size of an element (scaling the loop-end correspondingly), or have no loop counter and directly compare the current address to the address just beyond the end of the array. Compilers like to use tricks like these automatically.
It's actually much worse than that makes it sound, because loading those pointers in a linked list is completely serial. Yes, the CPU can load two things per cycle, but it takes 4 cycles until it knows where the next node is so that it can start loading the next pointer, so realistically it can find the address of a node only once every 4th cycle. Computing the addresses of array elements has no such problem, maybe there will be a latency of 1 between the computation of successive addresses but (because actual loops cannot be faster than that anyway) that only hurts when the loop is unrolled, and if necessary the address of the element k steps ahead can be computed just by adding k*sizeof(element) (so several addresses can be computed independently, and compilers do this too when they unroll loops).
Doing a sufficient amount of work per "step" through a linked list can hide the latency problem.