0

I'm currently trying to solve a problem where i have a large string of text (summary) and i'm searching for certain words within that summary. Based on one of a number of words exists in a certain category i want to be able to create an array of the respective tags as outlined below:

ground = ['car', 'motorbike']
air = ['plane']
colour = ['blue', 'red']

| Summary                | Tag_Array            |
|------------------------|----------------------|
| This is a blue car     | ['ground', 'colour'] |
| This is red motorbike  | ['ground', 'colour'] |
| This is a plane        | ['air']              |

The idea here being that it reads each summary and then creates an array in the Tag_Array column that contains the respective tags associated with the summary text. The tag for ground can be based on any number of potential options in this case both motorbike and car return the tag ground.

I functionally have this working with a really awful approach and its very verbose and so my intention here is to work out the most appropriate way to achieve this in Pyspark.

    df = (df
        .withColumn("summary_as_array", f.split('summary', " "))
        .withColumn("tag_array", f.array(
            f.when(f.array_contains('summary_as_array', "car"), "ground").otherwise(""),
            f.when(f.array_contains('summary_as_array', "motorbike"), "ground").otherwise("")
            )
        )
    )

1
  • Thinking about this some more i imagine a UDF is the most efficient way to approach this but i'm not really sure of the form it would take. Commented Mar 1, 2019 at 10:50

1 Answer 1

1

If you could convert the tags into a key-value pairs like this,

tagDict = {'ground':['car', 'motorbike'],'air':['plane'],'colour':['blue','red']}

then we can create an UDF to iterate over words in summary & values to get keys,which will be tags. A simple solution,

l = [('This is a blue car',),('This is red motorbike',),('This is a plane',)]
df = spark.createDataFrame(l,['summary'])

tag_udf = F.udf(lambda x : [k for k,v in tagDict.items() if any(itm in x for itm in v)])
df = df.withColumn('tag_array',tag_udf(df['summary']))
df.show()
+---------------------+----------------+
|summary              |tag_array       |
+---------------------+----------------+
|This is a blue car   |[colour, ground]|
|This is red motorbike|[colour, ground]|
|This is a plane      |[air]           |
+---------------------+----------------+

Hope this helps.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks this worked really well! Appreciate the concise nature of the answer and the simple implementation!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.