There's no reason to doubt the Arduino hardware. We need to put some figures on what you are experiencing, such as the baud rate and buffer sizes.
I would expect with this sort of hardware flow-control that you would potentially send a byte or so more than expected, because it takes a finite time to react to the flow-control signal. This is to be expected, particularly as the hardware has a buffer, so it can be starting to send a byte at the moment you get the signal to stop sending.
I expect you would have a leeway (say of 10 bytes) where you expect the sender to stop sending within that time.
A problem can arise if the Arduino API is providing more latency than the path under test
I'm not sure what you mean by this. However the HardwareSerial class has a 64-byte buffer (if you are using a Uno) which means you potentially have put 64 bytes into it, before noticing that the sender wants you to stop sending.
This is probably what you are experiencing. See HardwareSerial.cpp:
// Define constants and variables for buffering incoming serial data. We're
// using a ring buffer (I think), in which head is the index of the location
// to which to write the next incoming character and tail is the index of the
// location from which to read.
#if (RAMEND < 1000)
#define SERIAL_BUFFER_SIZE 16
#else
#define SERIAL_BUFFER_SIZE 64
#endif
For this test you might want to edit that file and reduce that number, substantially.
Alternatively, use SoftwareSerial. I don't think that buffers writes (as it doesn't have hardware to do them in the background) so that would reduce the latency to a single byte.
I believe this is most likely the case, but wanted to see if there has been any concrete numbers on the overhead of Arduino, particularly in regard to its Serial API.
It's not an overhead, it's a design issue. If you do what I suggested it should work fine.
What I mean by Arduino API overhead is the fact that the actual hardware interface is substantially abstracted away from the register level implementation ...
Well, not really. When you write a byte it puts it into a circular buffer, as you would expect. When the hardware is free to output another byte (generally called by an interrupt) it outputs it. What else could it do?
However you are absolutely free to address the hardware registers yourself and not use HardwareSerial.
To keep things simple though, what you really need to do here is not buffer because that makes reacting to the flow-control flag somewhat slow. And here is how to do it:
Serial.write ('a'); // or whatever
Serial.flush (); // wait until everything is sent
Now you have thrown away the buffering, and you can just send at the rate that the hardware can (probably slightly less).