TL;DR
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() + 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
Full Answer
Starting with your example:
val tuple2 = Tuple2("String1, String2", List("String3", "String4"))
and turning it into a DataFrame:
val df = List(tuple2).toDF("_1", "_2")
df.show(false)
which gives:
+----------------+------------------+
|_1 |_2 |
+----------------+------------------+
|String1, String2|[String3, String4]|
+----------------+------------------+
Now we are ready for the transformation:
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() + 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
which gives:
+----+-------+-------+
|Col1|Col2 |Col3 |
+----+-------+-------+
|1 |String1|String3|
|2 |String1|String4|
|3 |String2|String3|
|4 |String2|String4|
+----+-------+-------+
Extra reading for further detail
It is worth noting that the order of the operations is key:
- First we explode
"String1" and "String2" into their own rows:
df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.select("Col2")
.show(false)
gives:
+-------+
|Col2 |
+-------+
|String1|
|String2|
+-------+
where we go from 1 original row, to two.
- then we explode
"String3", "String4":
df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col2", "Col3")
.show(false)
gives:
+-------+-------+
|Col2 |Col3 |
+-------+-------+
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
+-------+-------+
- finally we add the incrementing count. If we did this earlier we copy the same number value to multiple rows.
for example:
df
// here we add `Col1` to a Dataset of only one row! So we only have the value `1`
.withColumn("Col1", monotonically_increasing_id() + 1)
// here we explode row 1, copying the value of `Col1`
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col1", "Col2", "Col3")
.show(false)
gives:
+----+-------+-------+
|Col1|Col2 |Col3 |
+----+-------+-------+
|1 |String1|String3|
|1 |String1|String4|
|1 |String2|String3|
|1 |String2|String4|
+----+-------+-------+