I'm currently trying to solve a problem where i have a large string of text (summary) and i'm searching for certain words within that summary. Based on one of a number of words exists in a certain category i want to be able to create an array of the respective tags as outlined below:
ground = ['car', 'motorbike']
air = ['plane']
colour = ['blue', 'red']
| Summary | Tag_Array |
|------------------------|----------------------|
| This is a blue car | ['ground', 'colour'] |
| This is red motorbike | ['ground', 'colour'] |
| This is a plane | ['air'] |
The idea here being that it reads each summary and then creates an array in the Tag_Array column that contains the respective tags associated with the summary text. The tag for ground can be based on any number of potential options in this case both motorbike and car return the tag ground.
I functionally have this working with a really awful approach and its very verbose and so my intention here is to work out the most appropriate way to achieve this in Pyspark.
df = (df
.withColumn("summary_as_array", f.split('summary', " "))
.withColumn("tag_array", f.array(
f.when(f.array_contains('summary_as_array', "car"), "ground").otherwise(""),
f.when(f.array_contains('summary_as_array', "motorbike"), "ground").otherwise("")
)
)
)