Split multiple fields or columns of a single row and create multiple rows using Scala

Question

I have a data-frame with 4 fields as mentioned below :

Field1 , Field2 , Field3 , Field4

I have values in the fields as below :

A1      , B1         , C1         , D1                
A2      , B2,B3      , C2,C3      , D2,D3             
A1      , B4,B5,B6   , C4,C5,C6   , D4,D5,D6

I have to convert it into the below format :

A1      , B1         , C1         , D1          
A2      , B2         , C2         , D2            
A2      , B3         , C3         , D3      
A1      , B4         , C4         , D4      
A1      , B5         , C5         , D5      
A1      , B6         , C6         , D6

Basically I have to split the comma separated values in multiple columns and form new rows based on the values in the same order.

You can consider all of them as of type String. Can you suggest me a way to do this splitting and forming new rows based on the new values.

I could see already a question similar to this as the below one:

How to flatmap a nested Dataframe in Spark

But this question is different as I have to consider splitting multiple columns in this case and the values should not repeat.

The duplicate question marked is different from this . I have multiple columns here to split at the same time. — rvp
– rvp, Commented Jul 14, 2016 at 22:17

zero323 · Accepted Answer · 2016-07-14 23:23:42Z

3

You can convert DataFrame to Dataset[(String, String, String, String)] and flatMap:

import scala.util.Try

val df = Seq(
  ("A1", "B1", "C1", "D1"),
  ("A2", "B2,B3", "C2,C3", "D2,D3"),
  ("A1", "B4,B5,B6", "C4,C5,C6", "D4,D5,D6")
).toDF("x1", "x2", "x3", "x4")

// A simple sequence of expressions which allows us to flatten the results
val exprs = (0 until df.columns.size).map(i => $"value".getItem(i))

df.select($"x1", array($"x2", $"x3", $"x4")).as[(String, Seq[String])].flatMap {
  case (x1, xs) => 
    Try(xs.map(_.split(",")).transpose).map(_.map("x" +: _)).getOrElse(Seq())
}.toDF.select(exprs:_*)

// +--------+--------+--------+--------+
// |value[0]|value[1]|value[2]|value[3]|
// +--------+--------+--------+--------+
// |      A1|      B1|      C1|      D1|
// |      A2|      B2|      C2|      D2|
// |      A2|      B3|      C3|      D3|
// |      A1|      B4|      C4|      D4|
// |      A1|      B5|      C5|      D5|
// |      A1|      B6|      C6|      D6|
// +--------+--------+--------+--------+

or use an UDF:

val splitRow = udf((xs: Seq[String]) => 
   Try(xs.map(_.split(",")).transpose).toOption)

// Same as before but we exclude the first column
val exprs = (0 until df.columns.size - 1).map(i => $"xs".getItem(i))

df
  .withColumn("xs", explode(splitRow(array($"x2", $"x3", $"x4"))))
  .select($"x1" +: exprs: _*)

edited Jul 14, 2016 at 23:23

answered Jul 14, 2016 at 22:57

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

rvp Over a year ago

Great! Your solution is nice and works for the given input. But if i change the input as below : val df = Seq( ("A2", "B2,B3", "C2,C3", "D2,D3"), ).toDF("x1", "x2", "x3", "x4") it is failing and giving error in transpose. Can you provide me a general solution.

zero323 Over a year ago

Can you provide a traceback? All strings have to be of the same size to make it work.

rvp Over a year ago

Awsome! It works so fluently. Hope you enjoyed answering it . ;)

damjad · Accepted Answer · 2020-05-16 09:00:01Z

0

You can use posexplode to solve this quickly. Refer http://allabouthadoop.net/hive-lateral-view-explode-vs-posexplode/ So, your code will be like below :

select 
    Field1, 
    Field2,
    Field3,
    Field4 
from temp_table
lateral view posexplode(Field2) pn as f2_1,f2_2, Field2
lateral view posexplode(Field3) pn as f3_1,f3_2, Field3 
lateral view posexplode(Field3) pn as f4_1,f4_2, Field4
where 
    (f2_1 == F3_1 and f3_1 == f4_1) and/or (f2_2 == F3_2 and f3_2 == f4_2)

edited May 16, 2020 at 9:00

damjad

1,2801 gold badge15 silver badges26 bronze badges

answered May 15, 2020 at 5:39

Anurag Anand

11 bronze badge

Collectives™ on Stack Overflow

Split multiple fields or columns of a single row and create multiple rows using Scala

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related