2

I am trying to calculate the similarity between all of the two possible pairs of words from a column in a Spark data frame. I have created a UDF as well as a data frame to test the function, which I have defined them as follow:

   #Similarity Function
def lcs_similarityy(vector):
  metric_lcs = MetricLCS()
  p = []
  for i in vector:
    for j in vector:
      p.append(1 - metric_lcs.distance(i, j))
  return  p
   
   #UDF
lcs_similarityyUDF = udf(lambda z: lcs_similarityy(z))

   #Spark Data Frame
df = spark.createDataFrame(["GERMAN", "GERMANIA", "GERMANY", "LENOVO"], "string").toDF("Name")

I am expecting a column of 16 rows since there are 16 possible posible pairs. However once I test the function

lcs_similarityUDF(df.select("Name"))

I am given the error:


TypeError: Invalid argument, not a string or column: DataFrame[Name: string] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I´ve been trying to fix this problem through different approaches but I cant make it work and I know very little about Spark and what could it be the problem. I dont know if I´ve made a mistake in the UDF or definig the data frame some any help with this is greatly appreciated.

1 Answer 1

2

Apply your UDF as follows

df.select(
    lcs_similarityUDF(df.Name).alias("Name")
)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.