Apache Spark dataframe column explode to multiple columns

Question

I am currently using Apache Spark 2.1.1 to process an XML file into a CSV. My goal is to flatten the XML but the problem I am currently facing is unbounded occurrences of elements. Spark automatically infer these unbounded occurrences into array. Now what I want to do is explode an array column.

 Sample Schema

 |-- Instrument_XREF_Identifier: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @bsid: string (nullable = true)
 |    |    |-- @exch_code: string (nullable = true)
 |    |    |-- @id_bb_sec_num: string (nullable = true)
 |    |    |-- @market_sector: string (nullable = true)

I know I can explode the array by this method

result = result.withColumn(p.name, explode(col(p.name)))

which will produce multiple rows with each array value containing struct. But the output I want to produce is to explode it into multiple columns instead of row.

Here is my expected output according to the schema I mentioned above:

Lets say that there are two struct values in the array.

bsid1   exch_code1   id_bb_sec_num1   market_sector1   bsid2   exch_code2   id_bb_sec_num2   market_sector2
123     3            1                13               234     12           212              221

How would variable length array map to fixed number of columns? Please post example input and expected output. — Alper t. Turker
– Alper t. Turker, Commented Jan 16, 2018 at 17:36

Raphael Roth · Accepted Answer · 2018-01-16 20:39:03Z

2

suppose Instrument_XREF_Identifier is a column of type array<struct<..>>, then you have to do it in two steps:

result
.withColumn("tmp",explode(col("Instrument_XREF_Identifier")))
.select("tmp.*")

This will give you a column for each of the struct elements.

There seems not to be a way to do it in 1 select/withColumn statement, see Explode array of structs to columns in Spark

answered Jan 16, 2018 at 20:39

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Maurice Basobas Over a year ago

But that would still just be exploded into multiple rows. I'm trying to approach it so that I'll create new columns when they are exploded.

Collectives™ on Stack Overflow

Apache Spark dataframe column explode to multiple columns

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related