Pyspark : removing special/numeric strings from array of string

Question

To keep it simple I have a df with the following schema:

root

 |-- Event_Time: string (nullable = true)

 |-- tokens: array (nullable = true)

 |    |-- element: string (containsNull = true)

some of the elements of "tokens" have number and special characters for example:

 "431883", "r2b2", "@refe98"

Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before. I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly. Thanks

edit2:

df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
          .select(regexp_replace('elements','\\w*\\d\\w**',""))
      )

This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?

After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key: df2 = df_1.select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>").alias("elements")).groupBy("Event_Time").agg(collect_list("elements").alias("tokens")) — pault
– pault, Commented Aug 6, 2018 at 13:38
Thanks actually that's what I m doing right now, will see how it goes I m not sure Event_time is unique key but i have other columns. It looks like what you wrote. Will edit if the solution works, I m still wondering if there is no elegant solution to this. — ichafai
– ichafai, Commented Aug 6, 2018 at 13:46
Unfortunately, there is currently no way to iterate over an array in pyspark without using an udf or rdd. — pault
– pault, Commented Aug 6, 2018 at 13:47
Ah very good to know, because I would probably have taken some time to still think about it. Thank you very much — ichafai
– ichafai, Commented Aug 6, 2018 at 13:49

Arun Gunalan · Accepted Answer · 2018-08-06 12:17:35Z

1

from pyspark.sql.functions import *
df = spark.createDataFrame([(["@a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{@]",""))
#df_1.printSchema()
df_1.show()

+----------+------+------+
|      data|data_1|data_2|
+----------+------+------+
|[@a, b, c]|@a,b,c| a,b,c|
|        []|      |      |
+----------+------+------+

edited Aug 6, 2018 at 12:17

answered Aug 6, 2018 at 11:29

Arun Gunalan

8247 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ichafai Over a year ago

Yeah this is what I tried at first, but there is mismatch type error. str1 is not string but array (array of string), and this is why I thought about using explode but with no success either. Thanks

ichafai Over a year ago

nope sorry still doesn't work, i m being late because even though i downsample the data, it still a lot of time to run commands

Arun Gunalan Over a year ago

I have posted with a sample, if your sample doesn't look like this, then post the sample using .show() for me to get better understanding

ichafai Over a year ago

it sure is helpful but still not a solution for my problem :( (I voted up anyway ;) not sure if it was taken into account since I m new here)

ichafai · Accepted Answer · 2018-08-06 14:19:48Z

1

The solution I found is (as also stated by pault in comment section):

After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.

here is the comment of pault: After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:

df2 = df_1
    .select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")        
    .alias("elements")).groupBy("Event_Time")
    .agg(collect_list("elements").alias("tokens"))

Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.

answered Aug 6, 2018 at 14:19

ichafai

3412 silver badges9 bronze badges

Comments

Adil B · Accepted Answer · 2021-07-08 18:44:26Z

0

The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:

from pyspark.sql import functions as F

df_2 = df_1.withColumn("tokens", 
                F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))

answered Jul 8, 2021 at 18:44

Adil B

17.2k12 gold badges72 silver badges92 bronze badges

Collectives™ on Stack Overflow

Pyspark : removing special/numeric strings from array of string

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related