Python/PySpark spacy return array instead of single string

Question

I am trying using spacy to tokenize a text and want to turn the string of tokens into an array. Currently using:

from pyspark.sql.functions import udf
import spacy
nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc]
tokenize = udf(spacy_tokenizer)

df2 = df.withColumn('TOKEN', tokenize('SENTENCE'))

from pyspark.sql.functions import array
df3 = df2.withColumn("TOKEN_ARRAY", array('TOKEN'))
df3.show()
+---------------+---------------------+-----------------------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           |
+---------------+---------------------+-----------------------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] |
+---------------+---------------------+-----------------------+

It is making an array with one element that is the full string, where as I want an array with 4 elements (with each individual token as an element). Tested this by using array contains which only shows true when I search the entire string and shows false when I search for an individual token.

from pyspark.sql.functions import array_contains
df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "[Cool, to, wear, .]")).show()
+---------------+---------------------+-----------------------+-------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           | test  |
+---------------+---------------------+-----------------------+-------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] | true  |
+---------------+---------------------+-----------------------+-------+


df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "Cool")).show()
+---------------+---------------------+-----------------------+-------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           | test  |
+---------------+---------------------+-----------------------+-------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] | false |
+---------------+---------------------+-----------------------+-------+

related : stackoverflow.com/questions/47682927/…

anky
– anky

2020-07-02 17:52:37 +00:00
Commented Jul 2, 2020 at 17:52 — anky
– anky, Commented Jul 2, 2020 at 17:52

user3242036 · Accepted Answer · 2020-07-02 18:09:18Z

1

Needed to specify array type this way and then it worked.

tokenize = udf(spacy_tokenizer, ArrayType(StringType()))

answered Jul 2, 2020 at 18:09

user3242036

7051 gold badge7 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python/PySpark spacy return array instead of single string

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related