Optimising Haskell data reading from file

Question

I am trying to implement Kosaraju's graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a summary data structure that has the node and lists of its incoming and outgoing edges. The code below achieves that, but takes over a minute, whereas I can see from posts on the MOOC forum that people using other languages are completing in <<10s. (getLines is taking 10s compared to under 1s in benchmarks I read about.)

I'm new to Haskell and have implemented an accumulation method using foldl' (the ' was a breakthrough in making it terminate at all), but it feels rather imperative in style, and I'm hoping that that's the reason why it is running slow. Moreover, I'm currently planning to use a similar pattern to conduct the depth-first-search, and I fear it will all just become too slow.

I have found this presentation and blog that talk about these sort of issues but at too expert a level.

import System.IO
import Control.Monad
import Data.Map.Strict as Map
import Data.List as L

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored (Edges, Edges) deriving (Show)

type Graph1 = Map NodeName Node

getLines :: FilePath -> IO [[Int]]
getLines = liftM (fmap (fmap read . words) . lines) . readFile

getLines' :: FilePath -> IO [(Int,Int)]
getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile

tuplify2 :: [a] -> (a,a)
tuplify2 [x,y] = (x,y)

main = do
    list <- getLines "testdata.txt"  -- [String]
    --list <- getLines "SCC.txt"  -- [String]   
    let
        list' = createGraph list
    return list'

createGraph :: [[Int]] -> Graph1
createGraph xs = L.foldl' build Map.empty xs
    where
        build :: Graph1-> [Int] -> Graph1
        build = \acc (x:y:_) ->
            let tmpAcc = case Map.lookup x acc of
                Nothing -> Map.insert x (Node False ([y],[])) acc
                Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc
            in case Map.lookup y tmpAcc of
                Nothing -> Map.insert y (Node False ([],[x])) tmpAcc
                Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc

You should consider using an array. Also, a list of exactly two elements shoud immediately smell fishy. — n. m. could be an AI
– n. m. could be an AI, Commented Jun 18, 2014 at 7:15
Will look at Arrays. The two element list comes straight from the source file and represents an edge. — Simon H
– Simon H, Commented Jun 18, 2014 at 7:21
There are other data structures beside the list. If you have exactly two of anything, list is not the first choice. — n. m. could be an AI
– n. m. could be an AI, Commented Jun 18, 2014 at 7:24

Community · Accepted Answer · 2017-05-23 11:43:52Z

10

Using maps:

Use IntMap or HashMap when possible. Both are significantly faster for Int keys than Map. HashMap is usually faster than IntMap but uses more RAM and has a less rich library.
Don't do unnecessary lookups. The containers package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph implementation in the question.

Example for createGraph:

import Data.List (foldl')
import qualified Data.IntMap.Strict as IM

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node

createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
    where
        addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
        addFwd y _                   = Just (Node False [y] [])
        addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
        addBwd x _                   = Just (Node False [] [x])

        build :: Graph1 -> (Int, Int) -> Graph1
        build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc

Using vectors:

Consider the efficient construction functions (the accumulators, unfolds, generate, iterate, constructN, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.
In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.
Use unboxed vectors when possible.
Use unsafe functions when you're absolutely sure about the bounds.
Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).

Example for createGraph:

import qualified Data.Vector as V

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node

createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
    graph    = V.replicate maxIndex (Node False [] [])
    graph'   = V.accum (\(Node e f b) x -> Node e (x:f) b) graph  edges
    graph''  = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)

Note that if there are gaps in the range of the node indices, then it'd be wise to either

Contiguously relabel the indices before doing anything else.
Introduce an empty constructor to Node to signify a missing index.

Faster I/O:

Use the IO functions from Data.Text or Data.ByteString. In both cases there are also efficient functions for breaking input into lines or words.

Example:

import qualified Data.ByteString.Char8 as BS
import System.IO

getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
    let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
    return [(a, b) | [a, b] <- pairs]

Benchmarking:

Always do it, unlike me in this answer. Use criterion.

edited May 23, 2017 at 11:43

CommunityBot

11 silver badge

answered Jun 18, 2014 at 9:57

András Kovács

30.2k3 gold badges56 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Simon H Over a year ago

This is awesome and I will work through it progressively. Learning Criterion is probably too much in the short run, but I tried your IntMap solution and it reduced runtime from 113s to 100s (and your code also includes the tuplify, which adds some time relative to my benchmark). Moe to follow

Carl Over a year ago

@SimonH1000 Criterion is actually really easy. The simplest uses of it are importing Criterion.Main then using defaultMain, bench, and whichever of the set nf, whnf, nfIO, or whnfIO you need.

jberryman Over a year ago

Thirding criterion; even if you just have a simple test that runs your createGraph on some pre-defined input, being able to see the distribution of runs, and having accurate measurements (as opposed to using time) will save you a ton of time and headaches. And once you've got your first test set up, it's really easy to add other bits of your program in there to make sure you have an accurate view of your code's performance.

Simon H Over a year ago

I've added some Criterion code in my answer (but not the change to main). I get a bunch of numbers out but can't find a simple run time

Simon H · Accepted Answer · 2014-06-19 12:56:07Z

4

Based pretty much on András' suggestions, I've reduced a 113 second task down to 24 (measured by stopwatch as I can't quite get Criterion to do anything yet) (and then down to 10 by compiling -O2)!!! I've attended some courses this last year that talked about the challenge of optimising for large datasets but this was the first time I faced a question that actually involved one, and it was as non-trivial as my instructors' suggested. This is what I have now:

import System.IO
import Control.Monad
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
import qualified Data.ByteString.Char8 as BS

type NodeName = Int
type Edges = [NodeName]
type Explored = Bool

data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node

-- DFS uses a stack to store next points to explore, a list can do this
type Stack = [(NodeName, NodeName)]

getBytes :: FilePath -> IO [(Int, Int)]
getBytes path = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
    let
        pairs = (map . map) (maybe (error "Can't read integers") fst . BS.readInt) lines
    return [(a,b) | [a,b] <- pairs]

main = do
    --list <- getLines' "testdata.txt"  -- [String]
    list <- getBytes "SCC.txt"  -- [String] 
    let list' = createGraph' list
    putStrLn $ show $ list' IM.! 66
    -- return list'


bmark = defaultMain [
    bgroup "1" [
        bench "Sim test" $ whnf bmark' "SCC.txt"
        ]
    ]

bmark' :: FilePath -> IO ()
bmark' path = do
    list <- getLines path
    let
        list' = createGraph list
    putStrLn $ show $ list' IM.! 2


createGraph' :: [(Int, Int)] -> Graph1
createGraph' xs = foldl' build IM.empty xs
    where
        addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
        addFwd y _                   = Just (Node False [y] [])
        addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
        addBwd x _                   = Just (Node False [] [x])

        build :: Graph1 -> (Int, Int) -> Graph1
        build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc

And now on with the rest of the exercise....

edited Jun 19, 2014 at 12:56

answered Jun 18, 2014 at 20:07

Simon H

21.2k14 gold badges84 silver badges144 bronze badges

8 Comments

András Kovács Over a year ago

Good job! By the way, I looked at your SCC.txt, and it actually has a contiguous range of nodes, with only the node "0" missing. So I could use pretty much the same vector code that I delineated. Here's a gist for it. Also, it runs in 4,7 secs on my computer.

András Kovács Over a year ago

Also, did you compile with optimizations (-O2, possibly also -fllvm)? I also ran the code you just posted here and it finished in 6,3 secs for me (or maybe you have a slower computer... I have Core i7 3770 CPU).

Simon H Over a year ago

WOW - I was a simple Prelude user until 5 mins ago - now down to 10s after first compiling with ghc -O2 <filename>

Simon H Over a year ago

@AndrásKovács Regrettably there is no node 37 for example, so Vector's are out, without a renumbering at least

Piezoid Over a year ago

You can store absent node as node without inbound and outbound edges. You can share the empty node between all the missing nodes, such that they only cost you the space for a pointer in the vector.

|

Piezoid · Accepted Answer · 2014-06-18 12:15:17Z

3

This is not really an answer, I would rather comment András Kovács post, if I add those 50 points...

I have implemented the loading of the graph in both IntMap and MVector, in a attempt to benchmark mutability vs. immutability.

Both program use Attoparsec for the parsing. There is surely more economic way to do it, but Attoparsec is relatively fast compared to its high abstraction level (the parser can stand in one line). The guideline is to avoid String and read. read is partial and slow, [Char] is slow and not memory efficient, unless properly fused.

As András Kovács noted, IntMap is better than Map for Int keys. My code provides another example of alter usage. If the node identifier mapping is dense, you may also want to use Vector and Array. They allow O(1) indexing by the identifier.

The mutable version handle on demand the exponential growth of the MVector. This avoid to precise an upper bound on node identifiers, but introduce more complexity (the reference on the vector may change).

I benchmarked with a file of 5M edges with identifiers in the range [0..2^16]. The MVector version is ~2x faster than the IntMap code (12s vs 25s on my computer).

The code is here [Gist].

I will edit when more profiling is done on my side.

answered Jun 18, 2014 at 12:15

Piezoid

6388 silver badges16 bronze badges

1 Comment

Piezoid Over a year ago

There is something wrong with my parser : it accumulate lots of bytestring. I tried to make it more strict, but it still the same.

Collectives™ on Stack Overflow

Optimising Haskell data reading from file

3 Answers 3

Using maps:

Using vectors:

Faster I/O:

Benchmarking:

4 Comments

8 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Using maps:

Using vectors:

Faster I/O:

Benchmarking:

4 Comments

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related