Reading a string and creating an array of mentioned sub-strings

Question

I'm currently trying to solve a problem where i have a large string of text (summary) and i'm searching for certain words within that summary. Based on one of a number of words exists in a certain category i want to be able to create an array of the respective tags as outlined below:

ground = ['car', 'motorbike']
air = ['plane']
colour = ['blue', 'red']

| Summary                | Tag_Array            |
|------------------------|----------------------|
| This is a blue car     | ['ground', 'colour'] |
| This is red motorbike  | ['ground', 'colour'] |
| This is a plane        | ['air']              |

The idea here being that it reads each summary and then creates an array in the Tag_Array column that contains the respective tags associated with the summary text. The tag for ground can be based on any number of potential options in this case both motorbike and car return the tag ground.

I functionally have this working with a really awful approach and its very verbose and so my intention here is to work out the most appropriate way to achieve this in Pyspark.

    df = (df
        .withColumn("summary_as_array", f.split('summary', " "))
        .withColumn("tag_array", f.array(
            f.when(f.array_contains('summary_as_array', "car"), "ground").otherwise(""),
            f.when(f.array_contains('summary_as_array', "motorbike"), "ground").otherwise("")
            )
        )
    )

Thinking about this some more i imagine a UDF is the most efficient way to approach this but i'm not really sure of the form it would take. — ImNewToThis
– ImNewToThis, Commented Mar 1, 2019 at 10:50

Suresh · Accepted Answer · 2019-03-02 05:41:06Z

1

If you could convert the tags into a key-value pairs like this,

tagDict = {'ground':['car', 'motorbike'],'air':['plane'],'colour':['blue','red']}

then we can create an UDF to iterate over words in summary & values to get keys,which will be tags. A simple solution,

l = [('This is a blue car',),('This is red motorbike',),('This is a plane',)]
df = spark.createDataFrame(l,['summary'])

tag_udf = F.udf(lambda x : [k for k,v in tagDict.items() if any(itm in x for itm in v)])
df = df.withColumn('tag_array',tag_udf(df['summary']))
df.show()
+---------------------+----------------+
|summary              |tag_array       |
+---------------------+----------------+
|This is a blue car   |[colour, ground]|
|This is red motorbike|[colour, ground]|
|This is a plane      |[air]           |
+---------------------+----------------+

Hope this helps.

answered Mar 2, 2019 at 5:41

Suresh

5,8802 gold badges27 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ImNewToThis Over a year ago

Thanks this worked really well! Appreciate the concise nature of the answer and the simple implementation!

Collectives™ on Stack Overflow

Reading a string and creating an array of mentioned sub-strings

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related