1

I am looking for a way to optimize a code like this one:

// for each line do many string concatenations
myRdd.map{x => "some_text" + x._1 + "some_other_text" + x._4 + ...}

I just read that use a

s"some_text${x._1}..." 

will be replaced by a basic string concatenation like in my map.

So my first thought was to use a StringBuilder like

   myRdd.map{x => 
    val sb = StringBuilder()
    sb.append("some_text")
    sb.append(x._1)
    ...
    sb

But the StringBuilder ojbect will be created for each line. Is there a best practice for this kind of optimization like declaring the StringBuilder somewhere else (an object or a class attribut) and use always the same instance in my map?

2 Answers 2

2

If you disassemble your code myRdd.map{x => "some_text" + x._1 + "some_other_text" + x._4 + ...}, it would show something like this:

NEW java/lang/StringBuilder
DUP
LDC 24
INVOKESPECIAL java/lang/StringBuilder.<init> (I)V
LDC "some_text"
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/String;)Ljava/lang/StringBuilder;
ALOAD 0
INVOKEVIRTUAL scala/Tuple2._1 ()Ljava/lang/Object;
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/Object;)Ljava/lang/StringBuilder;
LDC "some_other_text"
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/String;)Ljava/lang/StringBuilder;
ALOAD 0
INVOKEVIRTUAL scala/Tuple2._2 ()Ljava/lang/Object;
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/Object;)Ljava/lang/StringBuilder;
INVOKEVIRTUAL java/lang/StringBuilder.toString ()Ljava/lang/String;

So as you can see, scala compiler optimizes string concatenations to use StringBuilder and there is not really a difference between your first code snippet and second (Javac also does that). The first solution is preferable (especially version with string interpolation) because it's more readable.

You could reuse your string builder by resetting it using setLength method.

val sb = new StringBuilder()

myRdd.map{x =>
  sb.setLength(0)
  sb.append("some_text" + x._1 + "some_other_text" + x._2)
  sb.toString
}

But I don't know if it's worth it. String builders created in this loop won't leave eden memory region and will be immediately cleaned out by GC.

The downside of using an approach with StringBuilder is, it is a lot less readable, not functional and ugly. If you don't have serious problems with performance caused by this fragment I would stay with string interpolation. Remember Premature optimization is a root of all evil.

Sign up to request clarification or add additional context in comments.

Comments

1

Rather than using a global StringBuilder which is mutable, consider using a List to store the text by index and foldLeft to concatenate the text, as shown below:

val rdd = sc.parallelize(Seq(
  ("a", "b", "c", "d", "e"),
  ("f", "g", "h", "i", "j")
))

val textList = List((1, "x1"), (3, "x3"), (4, "x4"))

rdd.map( r => textList.foldLeft("")( (acc, kv) =>
  acc + kv._2 + r.productElement(kv._1 - 1)
) ).
collect
// res1: Array[String] = Array(x1ax3cx4d, x1fx3hx4i)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.