17

The Go code below reads in a 10,000 record CSV (of timestamp times and float values), runs some operations on the data, and then writes the original values to another CSV along with an additional column for score. However it is terribly slow (i.e. hours, but most of that is calculateStuff()) and I'm curious if there are any inefficiencies in the CSV reading/writing I can take care of.

package main

import (
  "encoding/csv"
  "log"
  "os"
  "strconv"
)

func ReadCSV(filepath string) ([][]string, error) {
  csvfile, err := os.Open(filepath)

  if err != nil {
    return nil, err
  }
  defer csvfile.Close()

  reader := csv.NewReader(csvfile)
  fields, err := reader.ReadAll()

  return fields, nil
}

func main() {
  // load data csv
  records, err := ReadCSV("./path/to/datafile.csv")
  if err != nil {
    log.Fatal(err)
  }

  // write results to a new csv
  outfile, err := os.Create("./where/to/write/resultsfile.csv"))
  if err != nil {
    log.Fatal("Unable to open output")
  }
  defer outfile.Close()
  writer := csv.NewWriter(outfile)

  for i, record := range records {
    time := record[0]
    value := record[1]

    // skip header row
    if i == 0 {
      writer.Write([]string{time, value, "score"})
      continue
    }

    // get float values
    floatValue, err := strconv.ParseFloat(value, 64)
    if err != nil {
      log.Fatal("Record: %v, Error: %v", floatValue, err)
    }

    // calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
    score := calculateStuff(floatValue)

    valueString := strconv.FormatFloat(floatValue, 'f', 8, 64)
    scoreString := strconv.FormatFloat(prob, 'f', 8, 64)
    //fmt.Printf("Result: %v\n", []string{time, valueString, scoreString})

    writer.Write([]string{time, valueString, scoreString})
  }

  writer.Flush()
}

I'm looking for help making this CSV read/write template code as fast as possible. For the scope of this question we need not worry about the calculateStuff method.

4
  • 1
    How slow is 'terribly slow'? What is the file size of the csv file you are testing? Because the word 'slow' is subjective... Commented Aug 15, 2015 at 17:58
  • 2
    Assuming the calculation is dependant only on the current record (e.g. you don't need to total all values of a column, or sort, or somesuch) don't suck in the whole file into memory but instead operate on each record as it it's read, the write it out. Avoid things like ioutil.ReadAll when you can instead operate on a stream. In any case, unless you're swapping (due to using too much memory) you're likely IO bound. Commented Aug 15, 2015 at 18:04
  • 1
    Also, make sure to check for errors!! E.g. play.golang.org/p/CBzz4lImW2 Commented Aug 15, 2015 at 18:17
  • Also, if this the entire program (i.e. if it's not a part of a larger program) it makes more sense not to muck with file names but just read from os.Stdin and write to os.Stdout and let the shell handle opening files (or using direct output/input from/to other programs!) for you. Commented Aug 15, 2015 at 18:25

3 Answers 3

26

You're loading the file in memory first then processing it, that can be slow with a big file.

You need to loop and call .Read and process one line at a time.

func processCSV(rc io.Reader) (ch chan []string) {
    ch = make(chan []string, 10)
    go func() {
        r := csv.NewReader(rc)
        if _, err := r.Read(); err != nil { //read header
            log.Fatal(err)
        }
        defer close(ch)
        for {
            rec, err := r.Read()
            if err != nil {
                if err == io.EOF {
                    break
                }
                log.Fatal(err)

            }
            ch <- rec
        }
    }()
    return
}

playground

//note it's roughly based on DaveC's comment.

Sign up to request clarification or add additional context in comments.

Comments

9

This is essentially Dave C's answer from the comments sections:

package main

import (
  "encoding/csv"
  "log"
  "os"
  "strconv"
)

func main() {
  // setup reader
  csvIn, err := os.Open("./path/to/datafile.csv")
  if err != nil {
    log.Fatal(err)
  }
  r := csv.NewReader(csvIn)

  // setup writer
  csvOut, err := os.Create("./where/to/write/resultsfile.csv"))
  if err != nil {
    log.Fatal("Unable to open output")
  }
  w := csv.NewWriter(csvOut)
  defer csvOut.Close()

  // handle header
  rec, err := r.Read()
  if err != nil {
    log.Fatal(err)
  }
  rec = append(rec, "score")
  if err = w.Write(rec); err != nil {
    log.Fatal(err)
  }

  for {
    rec, err = r.Read()
    if err != nil {
      if err == io.EOF {
        break
      }
      log.Fatal(err)
    }

    // get float value
    value := rec[1]
    floatValue, err := strconv.ParseFloat(value, 64)
    if err != nil {
      log.Fatal("Record, error: %v, %v", value, err)
    }

    // calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
    score := calculateStuff(floatValue)

    scoreString := strconv.FormatFloat(score, 'f', 8, 64)
    rec = append(rec, scoreString)

    if err = w.Write(rec); err != nil {
      log.Fatal(err)
    }
  w.Flush()
  }
}

Note of course the logic is all jammed into main(), better would be to split it into several functions, but that's beyond the scope of this question.

Comments

2

encoding/csv is indeed very slow on big files, as it performs a lot of allocations. Since your format is so simple I recommend using strings.Split instead which is much faster.

If even that is not fast enough you can consider implementing the parsing yourself using strings.IndexByte which is implemented in assembly: http://golang.org/src/strings/strings_decl.go?s=274:310#L1

Having said that, you should also reconsider using ReadAll if the file is larger than your memory.

1 Comment

A faster implementation using bufio.Scanner and bytes.LastIndex might look something like this. It's ~3x faster than encoding/csv but that's at the expense of a lot of the error checking and flexibility that encoding/csv provides.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.