Get index of item in array column in a Spark dataframe

Question

I am able to filter a Spark dataframe (in PySpark) based on particular value existence within an array column by doing the following:

from pyspark.sql.functions import array_contains
spark_df.filter(array_contains(spark_df.array_column_name, "value that I want"))

But is there a way to get the index of where in the array the item was found?

akuiper · Accepted Answer · 2018-12-13 00:46:04Z

14

In spark 2.4+, there's the array_position function:

df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])
df.show()
#+---------+
#|     data|
#+---------+
#|[c, b, a]|
#|       []|
#+---------+

from pyspark.sql.functions import array_position
df.select(df.data, array_position(df.data, "a").alias('a_pos')).show()
#+---------+-----+
#|     data|a_pos|
#+---------+-----+
#|[c, b, a]|    3|
#|       []|    0|
#+---------+-----+

Notes from the docs:

Locates the position of only the first occurrence of the given value in the given array;
The position is not zero based, but 1 based index. Returns 0 if the given value could not be found in the array.

edited Dec 13, 2018 at 0:46

answered Dec 12, 2018 at 21:45

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Vikrant Singh Rana Over a year ago

So this feature exists only with version 2.4?

akuiper Over a year ago

@vikrantrana Yes. It's new in version 2.4

Nikunj Kakadiya Over a year ago

Does this function exists for scala with spark?

Vikrant Singh Rana · Accepted Answer · 2018-12-13 06:50:16Z

I am using spark 2.3 version, so I tried this using udf.

df = spark.createDataFrame([(["c", "b", "a","e","f"],)], ['arraydata'])
+---------------+
|      arraydata|
+---------------+
|[c, b, a, e, f]|
+---------------+

user_func = udf (lambda x,y: [i for i, e in enumerate(x) if e==y ])

checking index position for item 'b':

newdf = df.withColumn('item_position',user_func(df.arraydata,lit('b')))

>>> newdf.show();
+---------------+-------------+
|      arraydata|item_position|
+---------------+-------------+
|[c, b, a, e, f]|          [1]|
+---------------+-------------+

checking index position for item 'e':

newdf = df.withColumn('item_position',user_func(df.arraydata,lit('e')))

>>> newdf.show();
+---------------+-------------+
|      arraydata|item_position|
+---------------+-------------+
|[c, b, a, e, f]|          [3]|
+---------------+-------------+

ZygD · Accepted Answer · 2024-09-11 19:00:42Z

0

This may be used to find more than one position:

F.array_compact(F.transform('data', lambda x, i: F.when(x == 'b', i)))

The result is zero-based.

Full example (Spark 3.4+):

from pyspark.sql import functions as F
df = spark.createDataFrame([(['a', 'b', 'b'],), (['c'],)], ['data'])
df.show()
# +---------+
# |     data|
# +---------+
# |[a, b, b]|
# |      [c]|
# +---------+

df = df.withColumn(
    'positions_b',
    F.array_compact(F.transform('data', lambda x, i: F.when(x == 'b', i)))
)
df.show()
# +---------+-----------+
# |     data|positions_b|
# +---------+-----------+
# |[a, b, b]|     [1, 2]|
# |      [c]|         []|
# +---------+-----------+

answered Sep 11, 2024 at 19:00

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Collectives™ on Stack Overflow

Get index of item in array column in a Spark dataframe

3 Answers 3

3 Comments

checking index position for item 'b':

checking index position for item 'e':

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

checking index position for item 'b':

checking index position for item 'e':

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related