I have a dataframe like this
data = [(("ID1", ['October', 'September', 'August'])), (("ID2", ['August', 'June', 'May'])),
(("ID3", ['October', 'June']))]
df = spark.createDataFrame(data, ["ID", "MonthList"])
df.show(truncate=False)
+---+----------------------------+
|ID |MonthList |
+---+----------------------------+
|ID1|[October, September, August]|
|ID2|[August, June, May] |
|ID3|[October, June] |
+---+----------------------------+
I want to compare every row with a default list, such that if the value is present assign 1 else 0
default_month_list = ['October', 'September', 'August', 'July', 'June', 'May']
Hence my expected output is this
+---+----------------------------+------------------+
|ID |MonthList |Binary_MonthList |
+---+----------------------------+------------------+
|ID1|[October, September, August]|[1, 1, 1, 0, 0, 0]|
|ID2|[August, June, May] |[0, 0, 1, 0, 1, 1]|
|ID3|[October, June] |[1, 0, 0, 0, 1, 0]|
+---+----------------------------+------------------+
I am able to do this in python, but don't know how to do this in pyspark