3

I have thousands of files (50K) and each file has around 10K lines.I read the file do some processing and write the lines back to an output file. While my reading and processing is way faster, the final step to convert the String Iterator back to a single String and write it to a file take a long time(almost a second.I wouldn't do the math for doing this for the whole population of files which is around 50K). I see this to be the bottleneck in the of improving my parsing time.

This is my code.

var processedLines = linesFromGzip(new File(fileName)).map(line => MyFunction(line))
var  outFile = Resource.fromFile(outFileName)

outFile.write(processedLines.mkString("\n"))  // severe overhead caused by this line-> processedLines.mkString("\n")

( I read on few other forums/blogs that mkString is much better than other approaches. (eg.)

Is there a better alternative to mkString("\n") ? Is there a totally different approach that would increase my speed of processing files. (remember, I have 50K files of each close to 10K lines).

2 Answers 2

4

Well you are repeating the operation 2 times: Once to iterate over the string and mkString "\n" and then writing these lines to a file. Instead you could do it in one go:

for(x <-processedLines){
    outFile.write(x);
    outFile.write("\n");
}
Sign up to request clarification or add additional context in comments.

11 Comments

@Learner I guess in processedLines.mkString("\n") where you iterate over the processedLines to append \n. And then next time while writing the new generated long String to file.
What type is outFile. I mean is it a BufferedWriter or something else? If it doesn't buffer then that takes extra clock ticks.
Well I am not aware of Resource class. But normally in such cases, the bottleneck should be i/o. If its cpu then there is an Issue. Try once with BufferedWriter and compare.
@Learner YOu might want to try outFile.writeStrings instead outFile.writeStrings(processedLines,"\n") (jesseeichar.github.io/scala-io-doc/0.3.0/api/scalax/io/…) . But still i think, BufferedWriter should be the fastest way out.
@Learner Do for(x <-processedLines){ bufferedWriter.write(x); bufferedWriter.write("\n"); }
|
1

Your writing is slower because you are using an Iterator. Iterators are lazily evaluated. Actually it is not your writing that is slow, but the evaluation of the Iteraor. An Iterator is lazily evaluated. This means it gets evaluated in the moment of use. Because you are mapping the elements of your Iterator it yields a new Iterator that is not evaluated yet. It gets evaluated at the moment you call mkString. This function transforms your Iterator into a String, that is stored in your RAM. To avoid this I recommend using a write function that accepts an Iterator as Jatin suggests. You could rewrite his code like:

processedLines.foreach(line => {
  outfile.write(line)
  outfile.write("\n")
}

This is actually an operation on an Iterator. It will evaluate one line at a time and write it.

5 Comments

Thanks for the explanation @T.Grotker. This process is bit better over the earlier approaches but not much better I would say.
Indeed you gain only the performance of only once traversing the Iterator (or String) instead of twice.
Yes, I would say. would changing the read approach enhance write performance ?
Yes and no. You would just transfer the calculation from mkString to another point of your programm. This is the best solution I can come up with at the moment.
Can you elaborate or give some links on what you mean ? Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.