3

I got this dataframe:

+------+-----------+--------------------+                                       
|NewsId|    expNews|            transArr|
+------+-----------+--------------------+
|     1|      House|[house, HH, AW1, S] |
|     1|Republicans|[republicans, R, ...|
|     1|       Fret|[fret, F, R, EH1, T]|
|     1|      About|[about, AH0, B, A...|

I want to remove every element at index 0 in the arrays in column transArr. Excpected result:

+------+-----------+--------------+                                       
|NewsId|    expNews|      transArr|
+------+-----------+--------------+
|     1|      House|[HH, AW1, S]  |
|     1|Republicans|[R, ...       |
|     1|       Fret|[F, R, EH1, T]|
|     1|      About|[AH0, B, A... |

Is there an easy way to to do this with Spark and Scala?

4 Answers 4

2

Check below code, It is faster than slice function

scala> df.show(false)
+------+-----------+---------------------+
|NewsId|expNews    |transArr             |
+------+-----------+---------------------+
|1     |House      |[house, HH, AW1, S]  |
|1     |Republicans|[republicans, R, ...]|
|1     |Fret       |[fret, F, R, EH1, T] |
|1     |About      |[about, AH0, B, A...]|
+------+-----------+---------------------+
scala> df
.withColumn(
    "modified_transArr",
    array_except(
        $"transArr",
        array($"transArr"(0))
    )
).show(false)
+------+-----------+---------------------+-----------------+
|NewsId|expNews    |transArr             |modified_transArr|
+------+-----------+---------------------+-----------------+
|1     |House      |[house, HH, AW1, S]  |[HH, AW1, S]     |
|1     |Republicans|[republicans, R, ...]|[R, ...]         |
|1     |Fret       |[fret, F, R, EH1, T] |[F, R, EH1, T]   |
|1     |About      |[about, AH0, B, A...]|[AH0, B, A...]   |
+------+-----------+---------------------+-----------------+
Sign up to request clarification or add additional context in comments.

1 Comment

This should be the accepted answer. I was struggling for $"transArr"(0) . Damn
1

For Spark 3.0+ you can use the filter function with an additional index argument:

df.withColumn("transArr", expr("filter(transArr, (x,i) -> i>0)"))

Comments

1

A Spark 2.4+ solution :

df.withColumn("transArr", array_except($"transArr", slice($"transArr",1,1)))

slice(arr,start,len) will return the 1st element as an array, and Subtract it from original array using array_except

Comments

1

Solution using Spark UDF Take the tail of the array. Ensure, you handle the null value in the column

    val parse_udf = udf(( value : Seq[String])=> value.tail)
df.withColumn("transArr", parse_udf($"transArr")).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.