I just tried compiling your code with avr-gcc 4.9.2:
buffer[0] = PORTD;
buffer[1] = PORTD;
...
buffer[29] = PORTD;
Here is what I got:
in r24, 0x0b ; temp = PORTD – 1 cycle
sts 0x0110, r24 ; buffer[0] = temp – 2 cycles
in r24, 0x0b ; temp = PORTD – 1 cycle
sts 0x0111, r24 ; buffer[1] = temp – 2 cycles
...
That's 3 cycles per read, i.e. a 5.33 Mhz reading frequency. For
some reason the compiler didn't want to use the st X+, r24
instruction suggested in Tom Carpenter's answer. Let's try to hint the
compiler a little bit, and rewrite the C code as follows:
uint8_t * p = buffer;
*p++ = PORTD;
*p++ = PORTD;
...
This generated the exact same assembly! The compiler somehow figured out
the address of each memory write, and it replaced each occurrence of the
pointer p by an explicit address. To prevent this kind of
“optimization”, let's make the pointer a variable whose value is unknown
at compile time:
void fill_buffer(uint8_t *p)
{
*p++ = PORTD;
*p++ = PORTD;
...
*p++ = PORTD;
}
Here is the generated assembly:
movw r30, r24 ; Z = p (Z is the register pair r31:r30)
in r24, 0x0b ; temp = PORTD – 1 cycle
st Z, r24 ; *Z = temp – 2 cycles
in r24, 0x0b ; temp = PORTD – 1 cycle
std Z+1, r24 ; *(Z+1) = temp – 2 cycles
...
in r24, 0x0b ; temp = PORTD – 1 cycle
std Z+29, r24 ; *(Z+29) = temp – 2 cycles
ret ; return
Still 3 cycles per read. Here the compiler is using the std
(store with displacement) instruction rather than st X+ (store with
post-increment).
In the end, what instruction the compiler chooses doesn't really matter.
All memory access instructions takes two cycles. Then, no matter what
you do, repeatedly transferring data from a port to RAM will take
3 cycles per transfer, irrespective of the instruction you choose
for the memory write.
Now, this doesn't mean you can't read faster. The AVR CPU core has
32 general purpose registers. Since you are only performing
30 port reads per burst, this means you can use the register file
as an ultra-fast temporary buffer. This seems easier to do in assembly,
and it will cost you a significant overhead in saving registers to the
stack and restoring them afterwards. But the read burst itself will be
faster:
; declare as:
; extern "C" void fill_buffer(uint8_t *p);
.global fill_buffer
fill_buffer:
; Prologue: save registers and move the pointer.
push r2 ; save all the registers belonging to the caller:
push r3 ; - 18 register to save (r2 – r17, r28, r29)
...
push r28 ; - 2 cycles per register
movw r30, r24 ; Z = p (Z = r31:r30 is a pointer register)
; Now we can read the port really fast.
in r0, 0x0b ; temp_0 = PORTD – 1 cycle
in r1, 0x0b ; temp_1 = PORTD – 1 cycle
...
in r29, 0x0b ; temp_29 = PORTD – 1 cycle
; Now save to RAM.
st Z+, r0 ; *Z++ = temp_0 – 2 cycles
st Z+, r1 ; *Z++ = temp_1 – 2 cycles
...
st Z+, r29 ; *Z++ = temp_29 – 2 cycles
; Epilogue: restore the registers.
pop r28 ; restore all the previously saved registers:
...
pop r3 ; - 18 registes to restore
pop r2 ; - 2 cycles per register
clr r1 ; leave r1 cleared, as required by the ABI
ret ; return
Now we are reading the port at 16 MHz: one read per cycle!
It turns out that that we can convince the compiler to do exactly this.
I had to see to believe, but it works. Something essentially equivalent
to the above assembly can be generated from C++ like this:
// Quickly read the port into temporaries.
uint8_t temp_0 = PORTD;
uint8_t temp_1 = PORTD;
...
uint8_t temp_29 = PORTD;
// Now save to RAM.
buffer[0] = temp_0;
buffer[1] = temp_1;
...
buffer[29] = temp_29;