pyspark query and sql pyspark query

Question

hey i have a dataframe that contains rows with this columns: date and text and i need to find how many rows contains the word "corona" per day(two ways dataframes and sql)

the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.

i started with removing the puntuation fron the text column then i added an indicator column called to mark if a row has the word corona in it after that i summed the check column and grouped by the date column

1.and i wanted to ask is this the right way to do such a thing?

2.i tried to translate this to a pyspark sql query (i need to add the check column with sql code if i am using this way) but the results were very different,so how can i translate this?

dataframes way:
#above i defiend the puntuation function and i read the data into df
df = df.withColumn('no_punc_text',punc_udf('text'))
df = df.select('no_punc_text','dates')
df.registerTempTable('my_table')
df = df.withColumn("check",F.col("no_punc_text").rlike("corona " or " corona" or " corona 
    ").cast("Integer"))
        dfway = df.groupBy("dates").sum('check')
the sql way:
sqlw = spark.sql(
      """
        select dates, sum(
         case when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona ') then 1 else 0 end
        ) as check
        from my_table group by dates
      """)

to get 1 if true or 0 for false

shreder1921
– shreder1921

2020-06-21 13:37:24 +00:00
Commented Jun 21, 2020 at 13:37 — shreder1921
– shreder1921, Commented Jun 21, 2020 at 13:37
did you get a chance to try the solution. Did it work?

Raghu
– Raghu

2020-06-28 11:21:03 +00:00
Commented Jun 28, 2020 at 11:21 — Raghu
– Raghu, Commented Jun 28, 2020 at 11:21

Som · Accepted Answer · 2020-06-21 15:19:51Z

2

use word boundary (\b) as below-

Load the test data

  val df = Seq("corona", "corona?", "this is corona", "coronavirus", "corona's", "is this corona?")
      .toDF("text")
      .withColumn("dates", monotonically_increasing_id())
    df.show(false)
    df.printSchema()

    /**
      * +---------------+-----+
      * |text           |dates|
      * +---------------+-----+
      * |corona         |0    |
      * |corona?        |1    |
      * |this is corona |2    |
      * |coronavirus    |3    |
      * |corona's       |4    |
      * |is this corona?|5    |
      * +---------------+-----+
      *
      * root
      * |-- text: string (nullable = true)
      * |-- dates: long (nullable = false)
      */

detect corona word as per below requirement

the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.

    df.createOrReplaceTempView("my_table")
    spark.sql(
      """
        | select dates, sum(
        |         case when (text rlike '\\bcorona\\b') then 1
        |         else 0 end
        |        ) as check
        |        from my_table group by dates
      """.stripMargin)
      .show(false)

    /**
      * +-----+-----+
      * |dates|check|
      * +-----+-----+
      * |2    |1    |
      * |4    |1    |
      * |5    |1    |
      * |0    |1    |
      * |1    |1    |
      * |3    |0    |
      * +-----+-----+
      */

Please note that coronavirus string is not detected as corona as you don't want to consider substring

In python

sqlw = spark.sql(
      """
         select dates, sum(
          case when (text rlike '\\bcorona\\b') then 1
          else 0 end
         ) as check
         from my_table group by dates
      """)

answered Jun 21, 2020 at 15:19

Som

6,3681 gold badge13 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

mazaneicha Over a year ago

Please note that "Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser... There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing." (spark.apache.org/docs/latest/api/sql/index.html#rlike) So with this setting set to false (default) your search pattern should actually be text rlike '\\\\bcorona\\\\b'. Upvoted nonetheless since overall this is the correct solution.

Som Over a year ago

I have executed the above solution without setting this property. In spark 2.4

Som Over a year ago

@Raghu, Do you mean did i execute the solution I provided? yes the one given in scala

General Grievance · Accepted Answer · 2022-11-09 21:35:49Z

0

I can help with the pyspark part. It is better to avoid using udf , always there is an equivalent way of doing it with inbuit function. In your case the contains() function of column will be helpful. Refer : https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=contain#pyspark.sql.Column.contains

Consider a test dataframe.

test_df= sqlContext.createDataFrame(["stay safe","lets make the world coronafree","corona spreads through contact","there is no vaccine yet for corona,but is in progress","community has to unite against corona."],"string").toDF('text')
test_df.show(truncate=False)    

+-----------------------------------------------------+
|text                                                 |
+-----------------------------------------------------+
|stay safe                                            |
|lets make the world coronafree                       |
|corona spreads through contact                       |
|there is no vaccine yet for corona,but is in progress|
|community has to unite against corona.               |
+-----------------------------------------------------+

test_df_f = test_df.where(F.col('text').contains('corona'))
test_df_f.show()
+-----------------------------------------------------+
|text                                                 |
+-----------------------------------------------------+
|lets make the world coronafree                       |
|corona spreads through contact                       |
|there is no vaccine yet for corona,but is in progress|
|community has to unite against corona.               |
+-----------------------------------------------------+

you can see that all the punctuation are already taken care. With this filtered dataframe ,test_df_f, you can perform a count to directly get the number of rows or any other date wise aggregation for further analysis.

If you need to match the whole word then you can use this:

test_df_f_whole = test_df.where("text RLIKE '\\\\bcorona\\\\b'")
test_df_f_whole.show(truncate=False)

+-----------------------------------------------------+
|text                                                 |
+-----------------------------------------------------+
|corona spreads through contact                       |
|there is no vaccine yet for corona,but is in progress|
|community has to unite against corona.               |
+-----------------------------------------------------+

Ref : How to use word boundary in RLIKE in PySpark SQL/Dataframes

edited Nov 9, 2022 at 21:35

General Grievance

5,12039 gold badges40 silver badges60 bronze badges

answered Jun 21, 2020 at 14:16

Raghu

1,71211 silver badges21 bronze badges

3 Comments

mazaneicha Over a year ago

I dont think this is going to work because contains() counts substrings as a match.

Raghu Over a year ago

@mazaneicha - thanks for the hint. .I have edited the answer. check it out

Raghu Over a year ago

@shreder1921 - checkout the edited answer and let me know

Collectives™ on Stack Overflow

pyspark query and sql pyspark query

2 Answers 2

Load the test data

detect corona word as per below requirement

In python

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Load the test data

detect corona word as per below requirement

In python

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related