1

i have a tuple of 2 elements like this :

Tuple2(“String1, String2”, ArrayList(“String3”, “String4”))
=> 1st element is a string with comma separated string values
=> 2nd element is an arraylist that contains a list of string

i would like to have a data frame like this :

Col1        Col2        Col3
1           String1     String3
2           String1     String4
3           String2     String3
4           String2     String4

1 Answer 1

1

TL;DR

import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}

df
    // `split` "String1, String2" into separate values, then create a row per value using `explode`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    // create a row per value in the list: "String3", "String4"
    .withColumn("Col3", explode(col("_2")))
    // now that we have our 4 rows, add a new column with an incrementing number
    .withColumn("Col1", monotonically_increasing_id() + 1)
    // only keep the columns we care about
    .select("Col1", "Col2", "Col3")
    .show(false)

Full Answer

Starting with your example:

val tuple2 = Tuple2("String1, String2", List("String3", "String4"))

and turning it into a DataFrame:

val df = List(tuple2).toDF("_1", "_2")

df.show(false)

which gives:

+----------------+------------------+
|_1              |_2                |
+----------------+------------------+
|String1, String2|[String3, String4]|
+----------------+------------------+

Now we are ready for the transformation:

import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}

df
    // `split` "String1, String2" into separate values, then create a row per value using `explode`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    // create a row per value in the list: "String3", "String4"
    .withColumn("Col3", explode(col("_2")))
    // now that we have our 4 rows, add a new column with an incrementing number
    .withColumn("Col1", monotonically_increasing_id() + 1)
    // only keep the columns we care about
    .select("Col1", "Col2", "Col3")
    .show(false)

which gives:

+----+-------+-------+
|Col1|Col2   |Col3   |
+----+-------+-------+
|1   |String1|String3|
|2   |String1|String4|
|3   |String2|String3|
|4   |String2|String4|
+----+-------+-------+

Extra reading for further detail

It is worth noting that the order of the operations is key:

  1. First we explode "String1" and "String2" into their own rows:
df
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .select("Col2")
    .show(false)

gives:

+-------+
|Col2   |
+-------+
|String1|
|String2|
+-------+

where we go from 1 original row, to two.

  1. then we explode "String3", "String4":
df
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .withColumn("Col3", explode(col("_2")))
    .select("Col2", "Col3")
    .show(false)

gives:

+-------+-------+
|Col2   |Col3   |
+-------+-------+
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
+-------+-------+
  1. finally we add the incrementing count. If we did this earlier we copy the same number value to multiple rows.

for example:

df
    // here we add `Col1` to a Dataset of only one row! So we only have the value `1`
    .withColumn("Col1", monotonically_increasing_id() + 1)
    // here we explode row 1, copying the value of `Col1`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .withColumn("Col3", explode(col("_2")))
    .select("Col1", "Col2", "Col3")
    .show(false)

gives:

+----+-------+-------+
|Col1|Col2   |Col3   |
+----+-------+-------+
|1   |String1|String3|
|1   |String1|String4|
|1   |String2|String3|
|1   |String2|String4|
+----+-------+-------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.