hey i have a dataframe that contains rows with this columns: date and text and i need to find how many rows contains the word "corona" per day(two ways dataframes and sql)
- the word corona need to be a word and not a substring and if the word have a puntuation mark next to it i need to count that as well.
i started with removing the puntuation fron the text column then i added an indicator column called to mark if a row has the word corona in it after that i summed the check column and grouped by the date column
1.and i wanted to ask is this the right way to do such a thing?
2.i tried to translate this to a pyspark sql query (i need to add the check column with sql code if i am using this way) but the results were very different,so how can i translate this?
dataframes way:
#above i defiend the puntuation function and i read the data into df
df = df.withColumn('no_punc_text',punc_udf('text'))
df = df.select('no_punc_text','dates')
df.registerTempTable('my_table')
df = df.withColumn("check",F.col("no_punc_text").rlike("corona " or " corona" or " corona
").cast("Integer"))
dfway = df.groupBy("dates").sum('check')
the sql way:
sqlw = spark.sql(
"""
select dates, sum(
case when (no_punc_text rlike ' corona') then 1
when (no_punc_text rlike ' corona') then 1
when (no_punc_text rlike ' corona ') then 1 else 0 end
) as check
from my_table group by dates
""")