4

Hi I have dataframe with 2 columns :

+----------------------------------------+----------+
|                  Text                  | Key_word |
+----------------------------------------+----------+
| First random text tree cheese cat      | tree     |
| Second random text apple pie three     | text     |
| Third random text burger food brain    | brain    |
| Fourth random text nothing thing chips | random   |
+----------------------------------------+----------+

I want to generate a 3rd columns with a word appearing before the key_word from the text.

+----------------------------------------+----------+-------------------+--+
|                  Text                  | Key_word | word_bef_key_word |  |
+----------------------------------------+----------+-------------------+--+
| First random text tree cheese cat      | tree     | text              |  |
| Second random text apple pie three     | text     | random            |  |
| Third random text burger food brain    | brain    | food              |  |
| Fourth random text nothing thing chips | random   | Fourth            |  |
+----------------------------------------+----------+-------------------+--+

I tried this but it's not working

df2=df1.withColumn('word_bef_key_word',regexp_extract(df1.Text,('\\w+)'df1.key_word,1))

Here is the code to create a example of the dataframe

df = sqlCtx.createDataFrame(
    [
        ('First random text tree cheese cat' , 'tree'),
        ('Second random text apple pie three', 'text'),
        ('Third random text burger food brain' , 'brain'),
        ('Fourth random text nothing thing chips', 'random')
    ],
    ('Text', 'Key_word') 
)
0

1 Answer 1

10

Update

You can also do this without a udf by using pyspark.sql.functions.expr to pass column values as a parameter to pyspark.sql.functions.regexp_extract:

from pyspark.sql.functions import expr

df = df.withColumn(
    'word_bef_key_word', 
    expr(r"regexp_extract(Text, concat('\\w+(?= ', Key_word, ')'), 0)")
)
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text                                  |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat     |tree    |text             |
#|Second random text apple pie three    |text    |random           |
#|Third random text burger food brain   |brain   |food             |
#|Fourth random text nothing thing chips|random  |Fourth           |
#+--------------------------------------+--------+-----------------+

Original Answer

One way to do this is by using a udf to do the regex:

import re
from pyspark.sql.functions import udf

def get_previous_word(text, key_word):
    matches = re.findall(r'\w+(?= {kw})'.format(kw=key_word), text)
    return matches[0] if matches else None

get_previous_word_udf = udf(
    lambda text, key_word: get_previous_word(text, key_word),
    StringType()
)

df = df.withColumn('word_bef_key_word', get_previous_word_udf('Text', 'Key_word'))
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text                                  |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat     |tree    |text             |
#|Second random text apple pie three    |text    |random           |
#|Third random text burger food brain   |brain   |food             |
#|Fourth random text nothing thing chips|random  |Fourth           |
#+--------------------------------------+--------+-----------------+

The regex pattern '\w+(?= {kw})'.format(kw=key_word) means match a word followed by a space and the key_word. If there are multiple matches, we will return the first one. If there are no matches, the function returns None.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.