select specific columns in Spark DataFrames from Array of Struct

Question

I have a Spark DataFrame df with the following Schema:

root
 |-- k: integer (nullable = false)
 |-- v: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: integer (nullable = false)
 |    |    |-- b: double (nullable = false)
 |    |    |-- c: string (nullable = true)

Is it possible to just select a, c in v from df without doing a map? In particular, df is loaded from a Parquet file and I don't want the values for c to even be loaded/read.

Roberto Congiu · Accepted Answer · 2016-05-12 22:21:57Z

1

It depends on exactly what you expect as an output, which is not clear from your question. Let me clarify. You can do

df.select($"v.a",$"v.b").show()

however, the result may be not what you want, since v is an array, it will yield an array for a and one per b. What you may want to do is explode the array v then select from the exploded dataframe:

df.select(explode($"v").as("v" :: Nil )).select($"v.a", $"v.b").show()

this will flatten v to a table with all its values flattened. In either case, spark/parquet should be smart enough to use predicate push down and not load c at all.

answered May 12, 2016 at 22:21

Roberto Congiu

5,2531 gold badge29 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

select specific columns in Spark DataFrames from Array of Struct

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related