Lateral view / explode in Spark with multiple columns, getting duplicates

Question

I have the following dataframe with some columns that contains arrays. (We are using spark 1.6)

+--------------------+--------------+------------------+--------------+--------------------+-------------+
|            UserName|     col1     |    col2          |col3          |col4                |col5         |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|foo                 |[Main, Indi...|[1777203, 1777203]|    [GBP, GBP]|            [CR, CR]|   [143, 143]|
+--------------------+--------------+------------------+--------------+--------------------+-------------+

And I expect the following result:

+--------------------+--------------+------------------+--------------+--------------------+-------------+
|            UserName|     explod   |    explod2       |explod3       |explod4             |explod5      |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|NNNNNNNNNNNNNNNNN...|      Main    |1777203           |    GBP      |     CR              |    143      |
|NNNNNNNNNNNNNNNNN...|Individual    |1777203           |    GBP      |     CR              |    143      |
----------------------------------------------------------------------------------------------------------

I have tried a Lateral view:

sqlContext.sql("SELECT `UserName`, explod, explod2, explod3, explod4, explod5 FROM sourceDF
LATERAL VIEW explode(`col1`) sourceDF AS explod 
LATERAL VIEW explode(`col2`) explod AS explod2 
LATERAL VIEW explode(`col3`) explod2 AS explod3 
LATERAL VIEW explode(`col4`) explod3 AS explod4 
LATERAL VIEW explode(`col5`) explod4 AS explod5")

But I get a cartesian product, with a lot of duplicates. I have tried the same, exploding all the columns with a withcolumn approach but still get a lot of duplicates

.withColumn("col1", explode($"col1"))...

Of course I can do a distinct to the final dataframe, but it's not an elegant solution. Is there any way to explode the columns without getting all this duplicates?

Thanks!

Possible duplicate of Explode (transpose?) multiple columns in Spark SQL table — user10938362
– user10938362, Commented May 21, 2019 at 17:39
Hi, that question was for Spark > 2.X, and we are using Spark 1.6, so most of the solutions provided on that question won't work. — AJDF
– AJDF, Commented May 22, 2019 at 9:19

ollik1 · Accepted Answer · 2019-05-21 16:38:01Z

5

If you are using Spark 2.4.0 or later, arrays_zip makes the task easier

val df = Seq(
  ("foo",
   Seq("Main", "Individual"),
   Seq(1777203, 1777203),
   Seq("GBP", "GBP"),
   Seq("CR", "CR"),
   Seq(143, 143)))
  .toDF("UserName", "col1", "col2", "col3", "col4", "col5")

df.select($"UserName",
          explode(arrays_zip($"col1", $"col2", $"col3", $"col4", $"col5")))
  .select($"UserName", $"col.*")
  .show()

Output:

+--------+----------+-------+----+----+----+
|UserName|      col1|   col2|col3|col4|col5|
+--------+----------+-------+----+----+----+
|     foo|      Main|1777203| GBP|  CR| 143|
|     foo|Individual|1777203| GBP|  CR| 143|
+--------+----------+-------+----+----+----+

answered May 21, 2019 at 16:38

ollik1

4,5601 gold badge12 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AJDF Over a year ago

Hi @ollik1, we are using Spark 1.6, so "arrays_zip" method is not available.

Collectives™ on Stack Overflow

Lateral view / explode in Spark with multiple columns, getting duplicates

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related