I am trying using spacy to tokenize a text and want to turn the string of tokens into an array. Currently using:
from pyspark.sql.functions import udf
import spacy
nlp = spacy.load("en_core_web_sm")
def spacy_tokenizer(text):
doc = nlp(text)
return [token.text for token in doc]
tokenize = udf(spacy_tokenizer)
df2 = df.withColumn('TOKEN', tokenize('SENTENCE'))
from pyspark.sql.functions import array
df3 = df2.withColumn("TOKEN_ARRAY", array('TOKEN'))
df3.show()
+---------------+---------------------+-----------------------+
| SENTENCE | TOKEN | TOKEN_ARRAY |
+---------------+---------------------+-----------------------+
| Cool to wear.| [Cool, to, wear, .]| [[Cool, to, wear, .]] |
+---------------+---------------------+-----------------------+
It is making an array with one element that is the full string, where as I want an array with 4 elements (with each individual token as an element). Tested this by using array contains which only shows true when I search the entire string and shows false when I search for an individual token.
from pyspark.sql.functions import array_contains
df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "[Cool, to, wear, .]")).show()
+---------------+---------------------+-----------------------+-------+
| SENTENCE | TOKEN | TOKEN_ARRAY | test |
+---------------+---------------------+-----------------------+-------+
| Cool to wear.| [Cool, to, wear, .]| [[Cool, to, wear, .]] | true |
+---------------+---------------------+-----------------------+-------+
df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "Cool")).show()
+---------------+---------------------+-----------------------+-------+
| SENTENCE | TOKEN | TOKEN_ARRAY | test |
+---------------+---------------------+-----------------------+-------+
| Cool to wear.| [Cool, to, wear, .]| [[Cool, to, wear, .]] | false |
+---------------+---------------------+-----------------------+-------+