0

here is my code:(get file line num and word count)

import System.IO
import Data.Maybe
readL::(Int,Int,Int)->IO()
readL (w,l,-1) = do
                putStrLn $ "word:" ++(show w )++"\nline:"++(show l)
readL (w,l,0) = do 
                s<-hIsEOF stdin
                if s 
                        then readL (w,l,-1)
                        else 
                                do
                                f<-getLine
                                readL (w+length f,l+1,0)

main = do
        hSetBinaryMode stdin True
        readL (0,0,0)

when I process a file with size 100m,it just crashes,with error: Stack space overflow: current size 8388608 bytes

Is there something I wrote wrong?

I also have another version here:

import System.IO
import Data.List
main = do
        hSetBinaryMode stdin True
        interact $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). foldl' (\(w,l) r-> (w + length r,l+1) ) (0,0)   .lines

this have the same problem too... and with lots of memory,so,anybody can slove this?I'm just a new learner in haskell.

1 Answer 1

2

The problem is that neither the w nor the l parameter to readL are evaluated before the end of input is reached. So for an input with many lines, you build huge thunks (((0 + length line1) + length line2) ... + length lastline), similar for l, and for more than half a million lines or so, evaluating that thunk will not fit in the available stack. Additionally, the length f holds on to the line read until it is evaluated, causing unnecessarily large memory use.

You have to keep the accumulating parameters evaluated in the loop, the easiest way is with bang-patterns

readL !(!w,!l,-1) = ...

or a seq:

readL (w,l,c)
    | w `seq` l `seq` (c == -1) = putStrLn $ "word:" ++(show w )++"\nline:"++(show l)
readL (w,l,0) = do 
                s<-hIsEOF stdin
                if s 
                        then readL (w,l,-1)
                        else 
                                do
                                f<-getLine
                                readL (w+length f,l+1,0)

The foldl' version has the same problem,

foldl' (\(w,l) r-> (w + length r,l+1) ) (0,0)

only evaluates the accumulator pair to weak head normal form, that is to the outermost constructor, here (,). It does not force evaluation of the components. To do that, you can

  • use a strict pair type for the fold

    data P = P !Int !Int
    
    foo = ... . foldl' (\(P w l) r -> P (w + length r) (l+1)) (P 0 0) . lines
    
  • or use seq in the folded function

    ... . foldl' (\(w,l) r -> w `seq` l `seq` (w + length r, l+1)) . lines
    
Sign up to request clarification or add additional context in comments.

4 Comments

I added a few words on why the foldl' doesn't help.
Final question,is foldl' version slower than the loop version? I mean,the foldl version will split all lines into memory,and then process it,or, because of the lazy feature,just like pipe in linux,it will act like the loop version?
The foldl' version will take the lines as they come, if the w parameter is kept evaluated so that nothing references old lines anymore, they can be garbage collected near immediately, so it doesn't need the entire input in memory at once. It starts processing before the entire input is available. The performance difference, if any, will be very small.
I made some tests.foldl version is faster than foldl' and faster than the loop version,so I guess that would because of lazy

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.