Poor Haskell performance with lazy lists

Question

I tried to test Haskell performance, but got some unxepectedly poor results:

-- main = do
--  putStrLn $ show $ sum' [1..1000000]

sum' :: [Int] -> Int
sum' [] = 0
sum' (x:xs) = x + sum' xs

I first ran it from ghci -O2:

> :set +s
> :sum' [1..1000000]
1784293664
(4.81 secs, 163156700 bytes)

Then I complied the code with ghc -O3, ran it using time and got this:

1784293664

real    0m0.728s
user    0m0.700s
sys     0m0.016s

Needless to say, these results are abysmal compared to the C code:

#include <stdio.h>

int main(void)
{
    int i, n;
    n = 0;
    for (i = 1; i <= 1000000; ++i)
        n += i;
    printf("%d\n", n);
}

After compiling it with gcc -O3 and running it with time I got:

1784293664

real    0m0.022s
user    0m0.000s
sys     0m0.000s

What is the reason for such poor performance? I assumed that Haskell would never actually construct the list, am I wrong in that assumption? Is this something else?

UPD: Is the problem that Haskell doesn't know that addition is associative? Is there a way to make it see and use that?

It's not enough for an answer, but I've found that this Real World Haskell chapter can be really useful for visualizing this stuff. — Jeff Burka
– Jeff Burka, Commented Dec 14, 2011 at 5:59
This is such a silly benchmark. At least here, gcc optimizes the loop away entirely and calls printf with a constant. If you use the vector package and the LLVM backend, GHC does the same thing. So you're comparing startup times of C and GHC... — Daniel Wagner
– Daniel Wagner, Commented Dec 14, 2011 at 7:04
As I mention at the bottom of my answer: This benchmark is at the crux of a legendary mailing list flamewar almost 3 years ago. Tread carefully. If you can, clarify what you care about: computations that must happen at runtime (ex: due to the bound on i being an input), compile-time loop optimizataions (reducing the loop by cramming N additions into each iteration for faster run-time execution), compile-time evaluation / expression elimination (pre-computing the answer and replacing the computation with a constant). — Thomas M. DuBuisson
– Thomas M. DuBuisson, Commented Dec 14, 2011 at 7:19
@ThomasM.DuBuisson , has there been any progress on loop unrolling in ghc since that mailing list thread? At my university I have several times seen beginners in haskell benchmark sum as shown above. When they realize it falls short of gcc I guess it would be nice to showcase Dons results if its a part of -O2. — HaskellElephant
– HaskellElephant, Commented Dec 14, 2011 at 10:14

bitmask · Accepted Answer · 2011-12-14 07:35:59Z

First, don't bother to discuss GHCi when you're talking about performance. It's nonsense to use -Ox flags with GHCi.

You're Building Up A Huge Computation

Using GHC 7.2.2 x86-64 with -O2 I get:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

The reason this uses so much stack space is upon every loop you build an expression of i+..., so your computation is transformed into a huge thunk:

n = 1 + (2 + (3 + (4 + ...

That's going to take a lot of memory. There is a reason the standard sum isn't defined like your sum'.

With A Reasonable Definition for sum

If I change your sum' to sum or an equivalent such as foldl' (+) 0 then I get:

$  ghc -O2 -fllvm so.hs
$ time ./so
500000500000

real    0m0.049s

Which seems entirely reasonable to me. Keep in mind that, with such a short-running piece of code much of your measured time is noise (loading the binary, starting up the RTS and GC nursery, misc initializations, etc). Use Criterion (a benchmarking tool) if you want accurate measurements of small-ish Haskell computations.

Comparing to C

My gcc -O3 time is immeasurably low (reported as 0.002 seconds) because the main routine consists of 4 instructions - the entire computation is evaluated at compile time and the constant of 0x746a5a2920 is stored in the binary.

There is a rather long Haskell thread (here, but be ware it's something of an epic flame war that still burns in peoples minds almost 3 years later) where people discuss the realities of doing this in GHC starting from your exact benchmark - it isn't there yet but they did come up with some Template Haskell work that would do this if you wish to achieve the same results selectively.

Philip JF · Accepted Answer · 2011-12-14 07:17:19Z

The GHC optimizer seems to not be doing as well as it should. Still, you can probably build a much better implementation of sum' using tail recursion and strict values.

Something like (using Bang Patterns):

sum' :: [Int] -> Int
sum' = sumt 0

sumt :: Int -> [Int] -> Int
sumt !n [] = n
sumt !n (x:xs) = sumt (n + x) xs

I havent tested that, but I would bet it gets closer to the c version.

Of course, you are still holding out on the optimizer to get rid of the list. You could just use the same algorithm as you do in c (using int i and a goto):

sumToX x = sumToX' 0 1 x
sumToX' :: Int -> Int -> Int -> Int
sumToX' !n !i x = if (i <= x) then sumToX' (n+i) (i+1) x else n

You still hope that GHC does loop unwinding at the imperative level.

I havent tested any of this, btw.

EDIT: thought I should point out that sum [1..1000000] really should be 500000500000 and is only 1784293664 because of an integer overflow. Why you would ever need to calculate this becomes an open question. Anyways, using ghc -O2 and a naive tail recursive version with no bang patterns (which should be exactly the sum in the standard lib) got me

real    0m0.020s
user    0m0.015s
sys     0m0.003s

Which made me think that the problem was just your GHC. But, it seems my machine is just faster, because the c ran at

real    0m0.005s
user    0m0.001s
sys     0m0.002s

My sumToX (with or without bang patterns) gets half way there

real    0m0.010s
user    0m0.004s
sys     0m0.003s

Edit 2: After disassembling code I think my answer to why the c is still twice as fast (as the list free version) is this: GHC has a lot more overhead before it ever gets to calling main. GHC generates a fair bit of runtime junk. Obviously this gets amortized on real code, but compare to the beauty GCC generates:

0x0000000100000f00 <main+0>:    push   %rbp
0x0000000100000f01 <main+1>:    mov    %rsp,%rbp
0x0000000100000f04 <main+4>:    mov    $0x2,%eax
0x0000000100000f09 <main+9>:    mov    $0x1,%esi
0x0000000100000f0e <main+14>:   xchg   %ax,%ax
0x0000000100000f10 <main+16>:   add    %eax,%esi
0x0000000100000f12 <main+18>:   inc    %eax
0x0000000100000f14 <main+20>:   cmp    $0xf4241,%eax
0x0000000100000f19 <main+25>:   jne    0x100000f10 <main+16>
0x0000000100000f1b <main+27>:   lea    0x14(%rip),%rdi        # 0x100000f36
0x0000000100000f22 <main+34>:   xor    %eax,%eax
0x0000000100000f24 <main+36>:   leaveq 
0x0000000100000f25 <main+37>:   jmpq   0x100000f30 <dyld_stub_printf>

Now, I'm not much of an X86 assembly programmer, but that looks more or less perfect.

Okay, I have graduate school applications to work on. No more.

Collectives™ on Stack Overflow

Poor Haskell performance with lazy lists

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related