Pyspark: Get index of array element based on substring

Question

I have the following dataframe, that contains a column of arrays (col1). I need to get the index of the element that contains a certain substring ("58=").

+-----------------------------------------------------------+-----+
|                                                      col1 |a_pos|
+-----------------------------------------------------------+-----+
|[8=FIX.4.4, 55=ITUBD264, 58=AID[43e39b2e-c6e2-4947]        |    0|
+-----------------------------------------------------------+-----+

I've tried to use array_position(col1, "58="), but it seems it only works with the exact match and not substrings.

In Python i'm doing exactly this, but in pandas, by using the following code:

df['idx'] = [max(range(len(l)), key=lambda x: '58=' in l[x]) for l in df['col1']]

wwnde · Accepted Answer · 2022-05-09 23:33:19Z

1

Check existence of 58 using the rlike function in a higher order function. Determine position using array_position. Code below

df = df.withColumn('index',expr("array_position(transform(col1, x-> rlike(x,58)),true)")).show(truncate=False)

+---------------------------------------------------+-----+-----+
|col1                                               |a_pos|index|
+---------------------------------------------------+-----+-----+
|[8=FIX.4.4, 55=ITUBD264, 58=AID[43e39b2e-c6e2-4947]|0    |3    |
+---------------------------------------------------+-----+-----+

edited May 9, 2022 at 23:33

answered May 9, 2022 at 22:52

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark: Get index of array element based on substring

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related