31

I have a Dataframe with one column. Each row of that column has an Array of String values:

Values in my Spark 2.2 Dataframe

["123", "abc", "2017", "ABC"]
["456", "def", "2001", "ABC"]
["789", "ghi", "2017", "DEF"]

org.apache.spark.sql.DataFrame = [col: array]

root
|-- col: array (nullable = true)
|    |-- element: string (containsNull = true)

What is the best way to access elements in the array? For example, I would like extract distinct values in the fourth element for the year 2017 (answer "ABC", "DEF").

4 Answers 4

43

Since Spark 2.4.0, there is a new function element_at($array_column, $index).

See Spark docs

Sign up to request clarification or add additional context in comments.

4 Comments

This is very helpful. Saved a lot of time. I was planning to write a UDF to do this which was erroring out. This answer is great.
I wish this existed in Spark 2.3.0. Is there an equivalent?
@AlexMoore-Niemi: You can use getItem, see the other answers here.
note: index start at 1
19
 df.where($"col".getItem(2) === lit("2017")).select($"col".getItem(3))

see getItem from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

2 Comments

Interesting to note is that getItem will simply return NULL if the index is out of range rather than thrown an exception.
and contrary to element_at, index start at 0
10

What is the best way to access elements in the array?

Accessing elements in an array column is by getItem operator.

getItem(key: Any): Column An expression that gets an item at position ordinal out of an array, or gets a value by key key in a MapType.

You could also use (ordinal) to access an element at ordinal position.

val ds = Seq(
  Array("123", "abc", "2017", "ABC"),
  Array("456", "def", "2001", "ABC"),
  Array("789", "ghi", "2017", "DEF")).toDF("col")
scala> ds.printSchema
root
 |-- col: array (nullable = true)
 |    |-- element: string (containsNull = true)
scala> ds.select($"col"(2)).show
+------+
|col[2]|
+------+
|  2017|
|  2001|
|  2017|
+------+

It's just a matter of personal choice and taste which approach suits you better, i.e. getItem or simply (ordinal).

And in your case where / filter followed by select with distinct give the proper answer (as @Will did).

Comments

2

you can do something like below

import org.apache.spark.sql.functions._

val ds = Seq(
 Array("123", "abc", "2017", "ABC"),
 Array("456", "def", "2001", "ABC"),
 Array("789", "ghi", "2017", "DEF")).toDF("col")

ds.withColumn("col1",element_at('col,1))
.withColumn("col2",element_at('col,2))
.withColumn("col3",element_at('col,3))
.withColumn("col4",element_at('col,4))
.drop('col)
.show()

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 123| abc|2017| ABC|
| 456| def|2001| ABC|
| 789| ghi|2017| DEF|
+----+----+----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.