3

I have a string sequence Seq[String] which represents stdin input lines.

Those lines map to a model entity, but it is not guaranteed that 1 line = 1 entity instance.

Each entity is delimited with a special string that will not occur anywhere else in the input.

My solution was something like:

val entities = lines.mkString.split(myDelimiter).map(parseEntity)

parseEntity implementation is not relevant, it gets a String and maps to a case class which represents the model entity

The problem is with a given input, I get an OutOfMemoryException on the lines.mkString. Would a fold/foldLeft/foldRight be more efficient? Or do you have any better alternative?

3
  • 2
    Just a little point to consider. Even when you will find a workaround for exhausting memory during mkString, the same problem may reoccur for map(parseEntity), as probably collection of all created entities will need similar amount of memory as the raw string. Commented Feb 23, 2017 at 17:42
  • 2
    Can you change the way you read your input so you read each entity into a string instead of each line? That would be the best way to improve this. Commented Feb 23, 2017 at 18:25
  • @puhlen no, i can't control the source... Commented Feb 24, 2017 at 10:43

2 Answers 2

2

You can solve this using akka streams and delimiter framing. See this section of the documentation for the basic approach.

import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Framing, Source}
import akka.util.ByteString

val example = (0 until 100).mkString("delimiter").grouped(8).toIndexedSeq
val framing = Framing.delimiter(ByteString("delimiter"), 1000)

implicit val system = ActorSystem()

implicit val mat = ActorMaterializer()

Source(example)
  .map(ByteString.apply)
  .via(framing)
  .map(_.utf8String)
  .runForeach(println)

The conversion to and from ByteString is a bit annoying, but Framing.delimiter is only defined for ByteString.

If you are fine with a more pure functional approach, fs2 will also offer primitives to solve this problem.

Sign up to request clarification or add additional context in comments.

3 Comments

could elaborate a bit more please? I'm not familiar with akka streams
thanks! although I would prefer not to add a new dependency just to this problem, I will have a look
fs2 doesn't have a built in way to split on an arbitrary delimiter.
0

Something that worked for me if you are reading from a stream (your mileage may vary). Slightly modified version of Scala LineIterator:

class EntityIterator(val iter: BufferedIterator[Char]) extends AbstractIterator[String] with Iterator[String] {
  private[this] val sb = new StringBuilder

  def getc() = iter.hasNext && {
    val ch = iter.next
    if (ch == '\n') false // Replace with your delimiter here
    else {
      sb append ch
      true
    }
  }

  def hasNext = iter.hasNext

  def next = {
    sb.clear
    while (getc()) { }
    sb.toString
  }
}

val entities =
  new EnityIterator(scala.io.Source.fromInputStream(...).iter.buffered)

entities.map(...)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.