PySpark error: TypeError: Invalid argument, not a string or column

Question

I am trying to calculate the similarity between all of the two possible pairs of words from a column in a Spark data frame. I have created a UDF as well as a data frame to test the function, which I have defined them as follow:

   #Similarity Function
def lcs_similarityy(vector):
  metric_lcs = MetricLCS()
  p = []
  for i in vector:
    for j in vector:
      p.append(1 - metric_lcs.distance(i, j))
  return  p
   
   #UDF
lcs_similarityyUDF = udf(lambda z: lcs_similarityy(z))

   #Spark Data Frame
df = spark.createDataFrame(["GERMAN", "GERMANIA", "GERMANY", "LENOVO"], "string").toDF("Name")

I am expecting a column of 16 rows since there are 16 possible posible pairs. However once I test the function

lcs_similarityUDF(df.select("Name"))

I am given the error:


TypeError: Invalid argument, not a string or column: DataFrame[Name: string] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I´ve been trying to fix this problem through different approaches but I cant make it work and I know very little about Spark and what could it be the problem. I dont know if I´ve made a mistake in the UDF or definig the data frame some any help with this is greatly appreciated.

ggordon · Accepted Answer · 2021-04-06 03:26:23Z

2

Apply your UDF as follows

df.select(
    lcs_similarityUDF(df.Name).alias("Name")
)

answered Apr 6, 2021 at 3:26

ggordon

10.1k2 gold badges19 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark error: TypeError: Invalid argument, not a string or column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related