... aren't they supposed to run at the same time?
Well, that depends.
If you have two or more cores, they may run concurrently.
Even if you have the available hardware, it's up to your OS to decide how to schedule your threads: if you want to encourage your OS to interleave both threads (you can't force it without more work), try adding sleep or nanosleep or yield calls to your loops (the exact primitives will depend on your platform).
If it helps you build an intuition about how and why a kernel will make scheduling decisions, note that most CPU architectures will keep a significant amount of state (branch prediction tables, data and instruction caches) that is really good at optimizing a single thread of execution.
Therefore, it's generally more efficient to let a given thread run on a given core for as long as possible, to minimize the number of avoidable context switches, cache misses and mis-predictions.
Now, timeslicing is often used as a sort of tradeoff between the best throughput for each individual process, and the best latency or responsiveness to external events. A thread may block (by waiting for an external event such as user input or device I/O, because it explicitly synchronizes with another thread, or explicitly sleeps or yields), in which case another thread will be scheduled while the first can't make progress, but otherwise it will typically run until the kernel pre-empts it at the end of its allotted time slice.
When the parent thread creates a child thread, I wouldn't like to guess which is "hotter" on the current core, so letting the parent finish its timeslice (unless it blocks) is a reasonable default.
The child thread is probably runnable right away, but if it doesn't pre-empt the parent thread, it isn't obvious why it should immediately pre-empt a thread on a different core either. After all, it's still in the same process as the parent thread, and shares the same memory, address maps and other resources: unless another core is completely idle, the best place to schedule the child is probably on the same core as its parent, because there's a decent chance the parent kept those shared resources warm in the cache there.
So, the reason your threads don't get interleaved is likely that neither runs for an appreciable fraction of a timeslice before the process exits, and neither does any blocking I/O or explicitly yields (stdout isn't blocking for that amount of data, as it'll easily be buffered).