0

I'm using PySpark to do simple dataframe filtering. The Spark dataframe df_rules looks like this:

enter image description here

I got this df_rules in this way:

from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local")\
   .appName("Association Rules FP-Growth")\
   .config("spark.some.config.option", "some-value")\
   .getOrCreate()

df = spark.createDataFrame([
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 2])
], ["id", "items"])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)

# Display frequent itemsets.
model.freqItemsets.show()

# Display generated association rules.
df_rules = model.associationRules

I just want to do df_rules.where(df_rules.consequent == [1]). It first gave me data type mismatch error, since df_rules.consequent is array<bigint>. So I converted consequent column data type through:

from pyspark.sql.types import ArrayType, IntegerType
df_rules = df_rules.withColumn("consequent", df_rules.consequent.cast(ArrayType(IntegerType())))

But still got error:

enter image description here

Do you know how can I do filtering successfully?

1 Answer 1

1

You don't have to convert array<bigint> to array<int>, just use long:

from pyspark.sql.functions import array, lit

df_rules.where(df_rules.consequent == array(lit(1L)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.