Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
- Their size must be
k+1 - They must contain either the set in
elements_1or the set inelements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.