4

I would like to remove some duplicated words in a column of pyspark dataframe.

based on Remove duplicates from PySpark array column

My Spark:

  2.4.5

Py3 code:

  test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
  t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.

  t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
  t5 = t4.withColumn('text', F.array_distinct("text"))
  t5.show(1, 120)

but got

 +--------------------------------------------------------+
 |                                                    text| 
 +--------------------------------------------------------+
 |[i like this book and this book be downloaded on line]|
 +--------------------------------------------------------+

I need to remove

 book and this

It seems that the "array_distinct" cannot filter them out ?

thanks

4
  • Do have a look into the given link. It might be helpful: stackoverflow.com/questions/47316783/… Commented Sep 15, 2020 at 5:11
  • and is not duplicated anywhere in the string. So based on what do you want to remove it? Or do you mean book and this? Can you show your desired final result? Commented Sep 15, 2020 at 7:29
  • it won't filter out anything because it's just an array of single string and not multiple strings so array_distinct just find one string in array. I assume you need to remove duplicate words from the string and not from the array of strings. Is this correct? Commented Sep 15, 2020 at 8:58
  • @user3448022, have you tried my answer and did it help? Commented Sep 21, 2020 at 21:20

1 Answer 1

3

You can use lcase , split , array_distinct and array_join functions from pyspark sql.functions

For example, F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")

Here is working code

import pyspark.sql.functions as F
df
.withColumn("text_new",
   F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)

Explaination:

Here, you first convert everthing to lower case with lcase(text) than split the array on whitespace with split(text,' '), which produces

[i, like, this, book, and, this, book, be, downloaded, on, line]|

then you pass this toarray_distinct, which produces

[i, like, this, book, and, be, downloaded, on, line]

and finally, join it with whitespace using array_join

i like this book and be downloaded on line
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.