I don't even know what the best title to phrase the questions.
I have the following dataset
df = spark.createDataFrame([\
(["1", "2","3","4"], ),\
(["1","2","3"], ),\
(["2","1","3"], ),\
(["2","3","4","1"], ),\
(["6","7"], )\
], ['cycle', ])
df.show()
+------------+
| cycle|
+------------+
|[1, 2, 3, 4]|
| [1, 2, 3]|
| [2, 1, 3]|
|[2, 3, 4, 1]|
| [6, 7]|
+------------+
What I would like to have at the end is:
- remove the permutations
- keep only the row with the maximum row that contains the all other sets
I can use sort_array() and distinct() to get rid of the permutations
df.select(f.sort_array("cycle").alias("cycle")).distinct().show()
+------------+
| cycle|
+------------+
|[1, 2, 3, 4]|
| [6, 7]|
| [1, 2, 3]|
+------------+
What I would like to reduce the dataset with Pyspark is:
+------------+
| cycle|
+------------+
|[1, 2, 3, 4]|
| [6, 7]|
+------------+
So check somehow that [1, 2, 3] is part of [1, 2, 3, 4] and only keep
So the Python Subset command A.issubset(B) applied in the Pyspark, Spark way over a column
The only way I can currently think of is a horrible iterative loop over very row which will kill every performance