pyspark dataframe: remove duplicates in an array column

Question

I would like to remove some duplicated words in a column of pyspark dataframe.

based on Remove duplicates from PySpark array column

My Spark:

  2.4.5

Py3 code:

  test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
  t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.

  t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
  t5 = t4.withColumn('text', F.array_distinct("text"))
  t5.show(1, 120)

but got

 +--------------------------------------------------------+
 |                                                    text| 
 +--------------------------------------------------------+
 |[i like this book and this book be downloaded on line]|
 +--------------------------------------------------------+

I need to remove

 book and this

It seems that the "array_distinct" cannot filter them out ?

thanks

Do have a look into the given link. It might be helpful: stackoverflow.com/questions/47316783/… — Muhammad Hamza Sabir
– Muhammad Hamza Sabir, Commented Sep 15, 2020 at 5:11
and is not duplicated anywhere in the string. So based on what do you want to remove it? Or do you mean book and this? Can you show your desired final result? — kfkhalili
– kfkhalili, Commented Sep 15, 2020 at 7:29
it won't filter out anything because it's just an array of single string and not multiple strings so array_distinct just find one string in array. I assume you need to remove duplicate words from the string and not from the array of strings. Is this correct? — Frosty
– Frosty, Commented Sep 15, 2020 at 8:58

A.B · Accepted Answer · 2020-10-01 23:17:12Z

3

You can use lcase , split , array_distinct and array_join functions from pyspark sql.functions

For example, F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")

Here is working code

import pyspark.sql.functions as F
df
.withColumn("text_new",
   F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)

Explaination:

Here, you first convert everthing to lower case with lcase(text) than split the array on whitespace with split(text,' '), which produces

[i, like, this, book, and, this, book, be, downloaded, on, line]|

then you pass this toarray_distinct, which produces

[i, like, this, book, and, be, downloaded, on, line]

and finally, join it with whitespace using array_join

i like this book and be downloaded on line

edited Oct 1, 2020 at 23:17

answered Sep 15, 2020 at 9:17

A.B

20.5k3 gold badges43 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark dataframe: remove duplicates in an array column

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related