Scala Spark: splitting dataframe column dynamically

Question

I am very new to scala and spark.

I have read a text file into a dataframe, and successfully split the single column into columns (essentially the file is SPACE delimited csv)

  val irisDF:DataFrame = spark.read.csv("src/test/resources/iris-in.txt")

  irisDF.show()

  val dfnew:DataFrame = irisDF.withColumn("_tmp", split($"_c0", " ")).select(
    $"_tmp".getItem(0).as("col1"),
    $"_tmp".getItem(1).as("col2"),
    $"_tmp".getItem(2).as("col3"),
    $"_tmp".getItem(3).as("col4")
  ).drop("_tmp")

This works.

BUT what if I do not know how many columns there are in the datafile? How do I dynamically generate the columns depending on the number of items generated by the split function?

akuiper · Accepted Answer · 2017-08-30 23:44:37Z

5

You can create a sequence of select expressions, and then apply all of them to select method with :_* syntax:

Example Data:

val df = Seq("a b c d", "e f g").toDF("c0")

df.show
+-------+
|     c0|
+-------+
|a b c d|
|  e f g|
+-------+

If you want five columns from the c0 column, which you need to determine before doing this:

val selectExprs = 0 until 5 map (i => $"temp".getItem(i).as(s"col$i"))

df.withColumn("temp", split($"c0", " ")).select(selectExprs:_*).show
+----+----+----+----+----+
|col0|col1|col2|col3|col4|
+----+----+----+----+----+
|   a|   b|   c|   d|null|
|   e|   f|   g|null|null|
+----+----+----+----+----+

answered Aug 30, 2017 at 23:44

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

juanchito Over a year ago

can you please explain what is "temp" and the general strategy?

Collectives™ on Stack Overflow

Scala Spark: splitting dataframe column dynamically

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related