1

I am trying using spacy to tokenize a text and want to turn the string of tokens into an array. Currently using:

from pyspark.sql.functions import udf
import spacy
nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc]
tokenize = udf(spacy_tokenizer)

df2 = df.withColumn('TOKEN', tokenize('SENTENCE'))

from pyspark.sql.functions import array
df3 = df2.withColumn("TOKEN_ARRAY", array('TOKEN'))
df3.show()
+---------------+---------------------+-----------------------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           |
+---------------+---------------------+-----------------------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] |
+---------------+---------------------+-----------------------+

It is making an array with one element that is the full string, where as I want an array with 4 elements (with each individual token as an element). Tested this by using array contains which only shows true when I search the entire string and shows false when I search for an individual token.

from pyspark.sql.functions import array_contains
df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "[Cool, to, wear, .]")).show()
+---------------+---------------------+-----------------------+-------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           | test  |
+---------------+---------------------+-----------------------+-------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] | true  |
+---------------+---------------------+-----------------------+-------+


df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "Cool")).show()
+---------------+---------------------+-----------------------+-------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           | test  |
+---------------+---------------------+-----------------------+-------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] | false |
+---------------+---------------------+-----------------------+-------+
1

1 Answer 1

1

Needed to specify array type this way and then it worked.

tokenize = udf(spacy_tokenizer, ArrayType(StringType()))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.