The GHC optimizer seems to not be doing as well as it should. Still, you can probably build a much better implementation of sum' using tail recursion and strict values.
Something like (using Bang Patterns):
sum' :: [Int] -> Int
sum' = sumt 0
sumt :: Int -> [Int] -> Int
sumt !n [] = n
sumt !n (x:xs) = sumt (n + x) xs
I havent tested that, but I would bet it gets closer to the c version.
Of course, you are still holding out on the optimizer to get rid of the list. You could just use the same algorithm as you do in c (using int i and a goto):
sumToX x = sumToX' 0 1 x
sumToX' :: Int -> Int -> Int -> Int
sumToX' !n !i x = if (i <= x) then sumToX' (n+i) (i+1) x else n
You still hope that GHC does loop unwinding at the imperative level.
I havent tested any of this, btw.
EDIT: thought I should point out that sum [1..1000000] really should be 500000500000 and is only 1784293664 because of an integer overflow. Why you would ever need to calculate this becomes an open question. Anyways, using ghc -O2 and a naive tail recursive version with no bang patterns (which should be exactly the sum in the standard lib) got me
real 0m0.020s
user 0m0.015s
sys 0m0.003s
Which made me think that the problem was just your GHC. But, it seems my machine is just faster, because the c ran at
real 0m0.005s
user 0m0.001s
sys 0m0.002s
My sumToX (with or without bang patterns) gets half way there
real 0m0.010s
user 0m0.004s
sys 0m0.003s
Edit 2: After disassembling code I think my answer to why the c is still twice as fast (as the list free version) is this: GHC has a lot more overhead before it ever gets to calling main. GHC generates a fair bit of runtime junk. Obviously this gets amortized on real code, but compare to the beauty GCC generates:
0x0000000100000f00 <main+0>: push %rbp
0x0000000100000f01 <main+1>: mov %rsp,%rbp
0x0000000100000f04 <main+4>: mov $0x2,%eax
0x0000000100000f09 <main+9>: mov $0x1,%esi
0x0000000100000f0e <main+14>: xchg %ax,%ax
0x0000000100000f10 <main+16>: add %eax,%esi
0x0000000100000f12 <main+18>: inc %eax
0x0000000100000f14 <main+20>: cmp $0xf4241,%eax
0x0000000100000f19 <main+25>: jne 0x100000f10 <main+16>
0x0000000100000f1b <main+27>: lea 0x14(%rip),%rdi # 0x100000f36
0x0000000100000f22 <main+34>: xor %eax,%eax
0x0000000100000f24 <main+36>: leaveq
0x0000000100000f25 <main+37>: jmpq 0x100000f30 <dyld_stub_printf>
Now, I'm not much of an X86 assembly programmer, but that looks more or less perfect.
Okay, I have graduate school applications to work on. No more.
gccoptimizes the loop away entirely and callsprintfwith a constant. If you use the vector package and the LLVM backend, GHC does the same thing. So you're comparing startup times of C and GHC...ibeing an input), compile-time loop optimizataions (reducing the loop by cramming N additions into each iteration for faster run-time execution), compile-time evaluation / expression elimination (pre-computing the answer and replacing the computation with a constant).