2

I have the following dataframe with some columns that contains arrays. (We are using spark 1.6)

+--------------------+--------------+------------------+--------------+--------------------+-------------+
|            UserName|     col1     |    col2          |col3          |col4                |col5         |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|foo                 |[Main, Indi...|[1777203, 1777203]|    [GBP, GBP]|            [CR, CR]|   [143, 143]|
+--------------------+--------------+------------------+--------------+--------------------+-------------+

And I expect the following result:

+--------------------+--------------+------------------+--------------+--------------------+-------------+
|            UserName|     explod   |    explod2       |explod3       |explod4             |explod5      |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|NNNNNNNNNNNNNNNNN...|      Main    |1777203           |    GBP      |     CR              |    143      |
|NNNNNNNNNNNNNNNNN...|Individual    |1777203           |    GBP      |     CR              |    143      |
----------------------------------------------------------------------------------------------------------

I have tried a Lateral view:

sqlContext.sql("SELECT `UserName`, explod, explod2, explod3, explod4, explod5 FROM sourceDF
LATERAL VIEW explode(`col1`) sourceDF AS explod 
LATERAL VIEW explode(`col2`) explod AS explod2 
LATERAL VIEW explode(`col3`) explod2 AS explod3 
LATERAL VIEW explode(`col4`) explod3 AS explod4 
LATERAL VIEW explode(`col5`) explod4 AS explod5")

But I get a cartesian product, with a lot of duplicates. I have tried the same, exploding all the columns with a withcolumn approach but still get a lot of duplicates

.withColumn("col1", explode($"col1"))...

Of course I can do a distinct to the final dataframe, but it's not an elegant solution. Is there any way to explode the columns without getting all this duplicates?

Thanks!

2
  • 2
    Possible duplicate of Explode (transpose?) multiple columns in Spark SQL table Commented May 21, 2019 at 17:39
  • 1
    Hi, that question was for Spark > 2.X, and we are using Spark 1.6, so most of the solutions provided on that question won't work. Commented May 22, 2019 at 9:19

1 Answer 1

5

If you are using Spark 2.4.0 or later, arrays_zip makes the task easier

val df = Seq(
  ("foo",
   Seq("Main", "Individual"),
   Seq(1777203, 1777203),
   Seq("GBP", "GBP"),
   Seq("CR", "CR"),
   Seq(143, 143)))
  .toDF("UserName", "col1", "col2", "col3", "col4", "col5")

df.select($"UserName",
          explode(arrays_zip($"col1", $"col2", $"col3", $"col4", $"col5")))
  .select($"UserName", $"col.*")
  .show()

Output:

+--------+----------+-------+----+----+----+
|UserName|      col1|   col2|col3|col4|col5|
+--------+----------+-------+----+----+----+
|     foo|      Main|1777203| GBP|  CR| 143|
|     foo|Individual|1777203| GBP|  CR| 143|
+--------+----------+-------+----+----+----+
Sign up to request clarification or add additional context in comments.

1 Comment

Hi @ollik1, we are using Spark 1.6, so "arrays_zip" method is not available.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.