How to remove element in an array by index in a Dataframe in Spark

Question

I got this dataframe:

+------+-----------+--------------------+                                       
|NewsId|    expNews|            transArr|
+------+-----------+--------------------+
|     1|      House|[house, HH, AW1, S] |
|     1|Republicans|[republicans, R, ...|
|     1|       Fret|[fret, F, R, EH1, T]|
|     1|      About|[about, AH0, B, A...|

I want to remove every element at index 0 in the arrays in column transArr. Excpected result:

+------+-----------+--------------+                                       
|NewsId|    expNews|      transArr|
+------+-----------+--------------+
|     1|      House|[HH, AW1, S]  |
|     1|Republicans|[R, ...       |
|     1|       Fret|[F, R, EH1, T]|
|     1|      About|[AH0, B, A... |

Is there an easy way to to do this with Spark and Scala?

s.polam · Accepted Answer · 2020-10-20 06:59:10Z

2

Check below code, It is faster than slice function

scala> df.show(false)
+------+-----------+---------------------+
|NewsId|expNews    |transArr             |
+------+-----------+---------------------+
|1     |House      |[house, HH, AW1, S]  |
|1     |Republicans|[republicans, R, ...]|
|1     |Fret       |[fret, F, R, EH1, T] |
|1     |About      |[about, AH0, B, A...]|
+------+-----------+---------------------+

scala> df
.withColumn(
    "modified_transArr",
    array_except(
        $"transArr",
        array($"transArr"(0))
    )
).show(false)
+------+-----------+---------------------+-----------------+
|NewsId|expNews    |transArr             |modified_transArr|
+------+-----------+---------------------+-----------------+
|1     |House      |[house, HH, AW1, S]  |[HH, AW1, S]     |
|1     |Republicans|[republicans, R, ...]|[R, ...]         |
|1     |Fret       |[fret, F, R, EH1, T] |[F, R, EH1, T]   |
|1     |About      |[about, AH0, B, A...]|[AH0, B, A...]   |
+------+-----------+---------------------+-----------------+

answered Oct 20, 2020 at 6:59

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sanket9394 Over a year ago

This should be the accepted answer. I was struggling for $"transArr"(0) . Damn

werner · Accepted Answer · 2020-10-19 19:20:19Z

1

For Spark 3.0+ you can use the filter function with an additional index argument:

df.withColumn("transArr", expr("filter(transArr, (x,i) -> i>0)"))

answered Oct 19, 2020 at 19:20

werner

15k6 gold badges36 silver badges56 bronze badges

Comments

Sanket9394 · Accepted Answer · 2020-10-19 19:29:56Z

1

A Spark 2.4+ solution :

df.withColumn("transArr", array_except($"transArr", slice($"transArr",1,1)))

slice(arr,start,len) will return the 1st element as an array, and Subtract it from original array using array_except

answered Oct 19, 2020 at 19:29

Sanket9394

2,1111 gold badge13 silver badges15 bronze badges

Comments

david gupta · Accepted Answer · 2020-10-19 20:07:09Z

1

Solution using Spark UDF Take the tail of the array. Ensure, you handle the null value in the column

    val parse_udf = udf(( value : Seq[String])=> value.tail)
df.withColumn("transArr", parse_udf($"transArr")).show()

answered Oct 19, 2020 at 20:07

david gupta

564 bronze badges

Collectives™ on Stack Overflow

How to remove element in an array by index in a Dataframe in Spark

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related