2

To keep it simple I have a df with the following schema:

root

 |-- Event_Time: string (nullable = true)

 |-- tokens: array (nullable = true)

 |    |-- element: string (containsNull = true)

some of the elements of "tokens" have number and special characters for example:

 "431883", "r2b2", "@refe98"

Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before. I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly. Thanks

edit2:

df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
          .select(regexp_replace('elements','\\w*\\d\\w**',""))
      )

This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?

4
  • After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key: df2 = df_1.select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>").alias("elements")).groupBy("Event_Time").agg(collect_list("elements").alias("tokens")) Commented Aug 6, 2018 at 13:38
  • Thanks actually that's what I m doing right now, will see how it goes I m not sure Event_time is unique key but i have other columns. It looks like what you wrote. Will edit if the solution works, I m still wondering if there is no elegant solution to this. Commented Aug 6, 2018 at 13:46
  • Unfortunately, there is currently no way to iterate over an array in pyspark without using an udf or rdd. Commented Aug 6, 2018 at 13:47
  • Ah very good to know, because I would probably have taken some time to still think about it. Thank you very much Commented Aug 6, 2018 at 13:49

3 Answers 3

1
from pyspark.sql.functions import *
df = spark.createDataFrame([(["@a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{@]",""))
#df_1.printSchema()
df_1.show()

+----------+------+------+
|      data|data_1|data_2|
+----------+------+------+
|[@a, b, c]|@a,b,c| a,b,c|
|        []|      |      |
+----------+------+------+
Sign up to request clarification or add additional context in comments.

4 Comments

Yeah this is what I tried at first, but there is mismatch type error. str1 is not string but array (array of string), and this is why I thought about using explode but with no success either. Thanks
nope sorry still doesn't work, i m being late because even though i downsample the data, it still a lot of time to run commands
I have posted with a sample, if your sample doesn't look like this, then post the sample using .show() for me to get better understanding
it sure is helpful but still not a solution for my problem :( (I voted up anyway ;) not sure if it was taken into account since I m new here)
1

The solution I found is (as also stated by pault in comment section):

After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.

here is the comment of pault: After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:

df2 = df_1
    .select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")        
    .alias("elements")).groupBy("Event_Time")
    .agg(collect_list("elements").alias("tokens")) 

Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.

Comments

0

The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:

from pyspark.sql import functions as F

df_2 = df_1.withColumn("tokens", 
                F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.