how to process large file without stack overflow error in haskell?

Question

here is my code:(get file line num and word count)

import System.IO
import Data.Maybe
readL::(Int,Int,Int)->IO()
readL (w,l,-1) = do
                putStrLn $ "word:" ++(show w )++"\nline:"++(show l)
readL (w,l,0) = do 
                s<-hIsEOF stdin
                if s 
                        then readL (w,l,-1)
                        else 
                                do
                                f<-getLine
                                readL (w+length f,l+1,0)

main = do
        hSetBinaryMode stdin True
        readL (0,0,0)

when I process a file with size 100m,it just crashes,with error: Stack space overflow: current size 8388608 bytes

Is there something I wrote wrong?

I also have another version here:

import System.IO
import Data.List
main = do
        hSetBinaryMode stdin True
        interact $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). foldl' (\(w,l) r-> (w + length r,l+1) ) (0,0)   .lines

this have the same problem too... and with lots of memory,so,anybody can slove this?I'm just a new learner in haskell.

Daniel Fischer · Accepted Answer · 2012-04-01 13:48:53Z

2

The problem is that neither the w nor the l parameter to readL are evaluated before the end of input is reached. So for an input with many lines, you build huge thunks (((0 + length line1) + length line2) ... + length lastline), similar for l, and for more than half a million lines or so, evaluating that thunk will not fit in the available stack. Additionally, the length f holds on to the line read until it is evaluated, causing unnecessarily large memory use.

You have to keep the accumulating parameters evaluated in the loop, the easiest way is with bang-patterns

readL !(!w,!l,-1) = ...

or a seq:

readL (w,l,c)
    | w `seq` l `seq` (c == -1) = putStrLn $ "word:" ++(show w )++"\nline:"++(show l)
readL (w,l,0) = do 
                s<-hIsEOF stdin
                if s 
                        then readL (w,l,-1)
                        else 
                                do
                                f<-getLine
                                readL (w+length f,l+1,0)

The foldl' version has the same problem,

foldl' (\(w,l) r-> (w + length r,l+1) ) (0,0)

only evaluates the accumulator pair to weak head normal form, that is to the outermost constructor, here (,). It does not force evaluation of the components. To do that, you can

use a strict pair type for the fold

data P = P !Int !Int

foo = ... . foldl' (\(P w l) r -> P (w + length r) (l+1)) (P 0 0) . lines

or use seq in the folded function

... . foldl' (\(w,l) r -> w `seq` l `seq` (w + length r, l+1)) . lines

edited Apr 1, 2012 at 13:48

answered Apr 1, 2012 at 13:27

Daniel Fischer

184k19 gold badges319 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Daniel Fischer Over a year ago

I added a few words on why the foldl' doesn't help.

vzex Over a year ago

Final question,is foldl' version slower than the loop version? I mean,the foldl version will split all lines into memory,and then process it,or, because of the lazy feature,just like pipe in linux,it will act like the loop version?

Daniel Fischer Over a year ago

The foldl' version will take the lines as they come, if the w parameter is kept evaluated so that nothing references old lines anymore, they can be garbage collected near immediately, so it doesn't need the entire input in memory at once. It starts processing before the entire input is available. The performance difference, if any, will be very small.

vzex Over a year ago

I made some tests.foldl version is faster than foldl' and faster than the loop version,so I guess that would because of lazy

Collectives™ on Stack Overflow

how to process large file without stack overflow error in haskell?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related