PySpark cannot match array

Question

I'm using PySpark to do simple dataframe filtering. The Spark dataframe df_rules looks like this:

I got this df_rules in this way:

from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local")\
   .appName("Association Rules FP-Growth")\
   .config("spark.some.config.option", "some-value")\
   .getOrCreate()

df = spark.createDataFrame([
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 2])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)

# Display frequent itemsets.
model.freqItemsets.show()

# Display generated association rules.
df_rules = model.associationRules

I just want to do df_rules.where(df_rules.consequent == [1]). It first gave me data type mismatch error, since df_rules.consequent is array<bigint>. So I converted consequent column data type through:

from pyspark.sql.types import ArrayType, IntegerType
df_rules = df_rules.withColumn("consequent", df_rules.consequent.cast(ArrayType(IntegerType())))

But still got error:

Do you know how can I do filtering successfully?

Sergey Khudyakov · Accepted Answer · 2018-05-28 10:04:27Z

1

You don't have to convert array<bigint> to array<int>, just use long:

from pyspark.sql.functions import array, lit

df_rules.where(df_rules.consequent == array(lit(1L)))

answered May 28, 2018 at 10:04

Sergey Khudyakov

1,1821 gold badge8 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark cannot match array

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related