I am trying to calculate the similarity between all of the two possible pairs of words from a column in a Spark data frame. I have created a UDF as well as a data frame to test the function, which I have defined them as follow:
#Similarity Function
def lcs_similarityy(vector):
metric_lcs = MetricLCS()
p = []
for i in vector:
for j in vector:
p.append(1 - metric_lcs.distance(i, j))
return p
#UDF
lcs_similarityyUDF = udf(lambda z: lcs_similarityy(z))
#Spark Data Frame
df = spark.createDataFrame(["GERMAN", "GERMANIA", "GERMANY", "LENOVO"], "string").toDF("Name")
I am expecting a column of 16 rows since there are 16 possible posible pairs. However once I test the function
lcs_similarityUDF(df.select("Name"))
I am given the error:
TypeError: Invalid argument, not a string or column: DataFrame[Name: string] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I´ve been trying to fix this problem through different approaches but I cant make it work and I know very little about Spark and what could it be the problem. I dont know if I´ve made a mistake in the UDF or definig the data frame some any help with this is greatly appreciated.