1

hey i have a dataframe that contains rows with this columns: date and text and i need to find how many rows contains the word "corona" per day(two ways dataframes and sql)

  • the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.

i started with removing the puntuation fron the text column then i added an indicator column called to mark if a row has the word corona in it after that i summed the check column and grouped by the date column

1.and i wanted to ask is this the right way to do such a thing?

2.i tried to translate this to a pyspark sql query (i need to add the check column with sql code if i am using this way) but the results were very different,so how can i translate this?

dataframes way:
#above i defiend the puntuation function and i read the data into df
df = df.withColumn('no_punc_text',punc_udf('text'))
df = df.select('no_punc_text','dates')
df.registerTempTable('my_table')
df = df.withColumn("check",F.col("no_punc_text").rlike("corona " or " corona" or " corona 
    ").cast("Integer"))
        dfway = df.groupBy("dates").sum('check')
the sql way:
sqlw = spark.sql(
      """
        select dates, sum(
         case when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona') then 1 
         when (no_punc_text rlike ' corona ') then 1 else 0 end
        ) as check
        from my_table group by dates
      """)
2
  • to get 1 if true or 0 for false Commented Jun 21, 2020 at 13:37
  • did you get a chance to try the solution. Did it work? Commented Jun 28, 2020 at 11:21

2 Answers 2

2

use word boundary (\b) as below-

Load the test data

  val df = Seq("corona", "corona?", "this is corona", "coronavirus", "corona's", "is this corona?")
      .toDF("text")
      .withColumn("dates", monotonically_increasing_id())
    df.show(false)
    df.printSchema()

    /**
      * +---------------+-----+
      * |text           |dates|
      * +---------------+-----+
      * |corona         |0    |
      * |corona?        |1    |
      * |this is corona |2    |
      * |coronavirus    |3    |
      * |corona's       |4    |
      * |is this corona?|5    |
      * +---------------+-----+
      *
      * root
      * |-- text: string (nullable = true)
      * |-- dates: long (nullable = false)
      */

detect corona word as per below requirement

the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.

    df.createOrReplaceTempView("my_table")
    spark.sql(
      """
        | select dates, sum(
        |         case when (text rlike '\\bcorona\\b') then 1
        |         else 0 end
        |        ) as check
        |        from my_table group by dates
      """.stripMargin)
      .show(false)

    /**
      * +-----+-----+
      * |dates|check|
      * +-----+-----+
      * |2    |1    |
      * |4    |1    |
      * |5    |1    |
      * |0    |1    |
      * |1    |1    |
      * |3    |0    |
      * +-----+-----+
      */

Please note that coronavirus string is not detected as corona as you don't want to consider substring

In python

sqlw = spark.sql(
      """
         select dates, sum(
          case when (text rlike '\\bcorona\\b') then 1
          else 0 end
         ) as check
         from my_table group by dates
      """)
Sign up to request clarification or add additional context in comments.

3 Comments

Please note that "Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser... There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing." (spark.apache.org/docs/latest/api/sql/index.html#rlike) So with this setting set to false (default) your search pattern should actually be text rlike '\\\\bcorona\\\\b'. Upvoted nonetheless since overall this is the correct solution.
I have executed the above solution without setting this property. In spark 2.4
@Raghu, Do you mean did i execute the solution I provided? yes the one given in scala
0

I can help with the pyspark part. It is better to avoid using udf , always there is an equivalent way of doing it with inbuit function. In your case the contains() function of column will be helpful. Refer : https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=contain#pyspark.sql.Column.contains

Consider a test dataframe.

test_df= sqlContext.createDataFrame(["stay safe","lets make the world coronafree","corona spreads through contact","there is no vaccine yet for corona,but is in progress","community has to unite against corona."],"string").toDF('text')
test_df.show(truncate=False)    

+-----------------------------------------------------+
|text                                                 |
+-----------------------------------------------------+
|stay safe                                            |
|lets make the world coronafree                       |
|corona spreads through contact                       |
|there is no vaccine yet for corona,but is in progress|
|community has to unite against corona.               |
+-----------------------------------------------------+

test_df_f = test_df.where(F.col('text').contains('corona'))
test_df_f.show()
+-----------------------------------------------------+
|text                                                 |
+-----------------------------------------------------+
|lets make the world coronafree                       |
|corona spreads through contact                       |
|there is no vaccine yet for corona,but is in progress|
|community has to unite against corona.               |
+-----------------------------------------------------+

you can see that all the punctuation are already taken care. With this filtered dataframe ,test_df_f, you can perform a count to directly get the number of rows or any other date wise aggregation for further analysis.

If you need to match the whole word then you can use this:

test_df_f_whole = test_df.where("text RLIKE '\\\\bcorona\\\\b'")
test_df_f_whole.show(truncate=False)

+-----------------------------------------------------+
|text                                                 |
+-----------------------------------------------------+
|corona spreads through contact                       |
|there is no vaccine yet for corona,but is in progress|
|community has to unite against corona.               |
+-----------------------------------------------------+

Ref : How to use word boundary in RLIKE in PySpark SQL/Dataframes

3 Comments

I dont think this is going to work because contains() counts substrings as a match.
@mazaneicha - thanks for the hint. .I have edited the answer. check it out
@shreder1921 - checkout the edited answer and let me know

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.