Lets say i loaded a json file into Spark 1.6 via
sqlContext.read.json("/hdfs/")
it gives me a Dataframe with following schema:
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- checked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- color: array (nullable = true)
| |-- element: string (containsNull = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
The DF has only one row with an Array of all my Items inside.
+--------------------+--------------------+--------------------+
| id_e| checked_e| color_e|
+--------------------+--------------------+--------------------+
|[0218797c-77a6-45...|[false, true, tru...|[null, null, null...|
+--------------------+--------------------+--------------------+
I want to have a DF with the arrays exploded into one item per line.
+--------------------+-----+-------+
| id|color|checked|
+--------------------+-----+-------+
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
...
So far i achieved this by creating a temporary table from the array DF and used sql with lateral view explode on these lines.
val results = sqlContext.sql("
SELECT id, color, checked from temptable
lateral view explode(checked_e) temptable as checked
lateral view explode(id_e) temptable as id
lateral view explode(color_e) temptable as color
")
Is there any way to achieve this directly with Dataframe functions without using SQL? I know there is something like df.explode(...) but i could not get it to work with my Data
EDIT: It seems the explode isnt what i really wanted in the first place. I want a new dataframe that has each item of the arrays line by line. The explode function actually gives back way more lines than my initial dataset has.