9

I am able to filter a Spark dataframe (in PySpark) based on particular value existence within an array column by doing the following:

from pyspark.sql.functions import array_contains
spark_df.filter(array_contains(spark_df.array_column_name, "value that I want")) 

But is there a way to get the index of where in the array the item was found?

0

3 Answers 3

14

In spark 2.4+, there's the array_position function:

df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])
df.show()
#+---------+
#|     data|
#+---------+
#|[c, b, a]|
#|       []|
#+---------+

from pyspark.sql.functions import array_position
df.select(df.data, array_position(df.data, "a").alias('a_pos')).show()
#+---------+-----+
#|     data|a_pos|
#+---------+-----+
#|[c, b, a]|    3|
#|       []|    0|
#+---------+-----+

Notes from the docs:

  1. Locates the position of only the first occurrence of the given value in the given array;

  2. The position is not zero based, but 1 based index. Returns 0 if the given value could not be found in the array.

Sign up to request clarification or add additional context in comments.

3 Comments

So this feature exists only with version 2.4?
@vikrantrana Yes. It's new in version 2.4
Does this function exists for scala with spark?
3

I am using spark 2.3 version, so I tried this using udf.

df = spark.createDataFrame([(["c", "b", "a","e","f"],)], ['arraydata'])
+---------------+
|      arraydata|
+---------------+
|[c, b, a, e, f]|
+---------------+

user_func = udf (lambda x,y: [i for i, e in enumerate(x) if e==y ])

checking index position for item 'b':

newdf = df.withColumn('item_position',user_func(df.arraydata,lit('b')))

>>> newdf.show();
+---------------+-------------+
|      arraydata|item_position|
+---------------+-------------+
|[c, b, a, e, f]|          [1]|
+---------------+-------------+

checking index position for item 'e':

newdf = df.withColumn('item_position',user_func(df.arraydata,lit('e')))

>>> newdf.show();
+---------------+-------------+
|      arraydata|item_position|
+---------------+-------------+
|[c, b, a, e, f]|          [3]|
+---------------+-------------+

Comments

0

This may be used to find more than one position:

F.array_compact(F.transform('data', lambda x, i: F.when(x == 'b', i)))

The result is zero-based.


Full example (Spark 3.4+):

from pyspark.sql import functions as F
df = spark.createDataFrame([(['a', 'b', 'b'],), (['c'],)], ['data'])
df.show()
# +---------+
# |     data|
# +---------+
# |[a, b, b]|
# |      [c]|
# +---------+

df = df.withColumn(
    'positions_b',
    F.array_compact(F.transform('data', lambda x, i: F.when(x == 'b', i)))
)
df.show()
# +---------+-----------+
# |     data|positions_b|
# +---------+-----------+
# |[a, b, b]|     [1, 2]|
# |      [c]|         []|
# +---------+-----------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.