2
root
 |
 |-- dogs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- color: string (nullable = true)
 |    |    |    |-- sources: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |-- _2: age (nullable = true)

Which shows below with data.select("dogs").show(2,False)

+---------------------------------------------------------------------------------+
|names                                                                            |
+---------------------------------------------------------------------------------+
|[[[Max,White,WrappedArray(SanDiego)],3], [[Spot,Black,WrappedArray(SanDiego)],2]]|
|[[[Michael,Black,WrappedArray(SanJose)],1]]                                      |
+---------------------------------------------------------------------------------+
only showing top 2 rows

I am wondering if it is possible to access the array elements in each cell? For example, I want to retrieve (Max, white), (Spot, Black) and (Michael, Black) from the dogs column.

In additional, I would like to expand the rows with n elements to n rows if possible.

Thanks!

3
  • 2
    Possible duplicate of Querying Spark SQL DataFrame with complex types Commented Apr 25, 2016 at 19:06
  • It is the same question in scala-spark, though Edamame seems to be working with pyspark code. Not sure how SO should organize these (especially given their similarity), but the pyspark equivalent answer is below. Commented Apr 25, 2016 at 19:24
  • can you post a sample data set. Commented Apr 25, 2016 at 20:09

1 Answer 1

3

You can use explode as below to get access to a dataframe with each row being a record from the array.

data.registerTempTable("data")
dataExplode = sqlContext.sql("select explode(dogs) as dog from data")
dataExplode.show()

Then, you can use select to obtain just the columns you are interested in.

Sign up to request clarification or add additional context in comments.

2 Comments

@Edamame sorry, there was a typo in the code. I forgot the quotes around "data" in registerTempTable. I edited the code, hopefully it works for you now
explode() is useful when working with nested datatypes in dataframes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.