0

I am new to both Scala and Spark. I am trying to transform an input read from files as Double into Float (which is safe in this application) so as to reduce the memory usage. I have been able to do that with a column of Double.

Current approach for a single element:

import org.apache.spark.sql.functions.{col, udf}
val tcast = udf((s: Double) => s.toFloat)

val myDF = Seq(
   (1.0, Array(0.1, 2.1, 1.2)),
   (8.0, Array(1.1, 2.1, 3.2)),
   (9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")

myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").show
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").schema

But currently stuck with transforming array of doubles to floats. Any help would be appreciated.

1 Answer 1

1

You can use selectExpr, like:

val myDF = Seq(
   (1.0, Array(0.1, 2.1, 1.2)),
   (8.0, Array(1.1, 2.1, 3.2)),
   (9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")

myDF.printSchema()

// output:
root
 |-- time: double (nullable = false)
 |-- crds: array (nullable = true)
 |    |-- element: double (containsNull = false)

val df = myDF.selectExpr("cast(time as float) time", "cast(crds as array<float>) as crds")
df.show()

+----+---------------+
|time|           crds|
+----+---------------+
| 1.0|[0.1, 2.1, 1.2]|
| 8.0|[1.1, 2.1, 3.2]|
| 9.0|[1.1, 1.1, 2.2]|
+----+---------------+

df.printSchema()

root
 |-- time: float (nullable = false)
 |-- crds: array (nullable = true)
 |    |-- element: float (containsNull = true)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the quick reply but crds are still double!?
Thank you very much for the solution!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.