I have dataframe:
df_example = pd.DataFrame({'user1': ['u1', 'u1', 'u1', 'u5', 'u5', 'u5', 'u7','u7','u6'],
'user2': ['u2', 'u3', 'u4', 'u2', 'u4','u6','u8','u3','u6']})
sdf = spark.createDataFrame(df_example)
userreposts_gr = sdf.groupby('user1').agg(F.collect_list('user2').alias('all_user2'))
userreposts_gr.show()
+-----+------------+
|user1| all_user2|
+-----+------------+
| u1|[u4, u2, u3]|
| u7| [u8, u3]|
| u5|[u4, u2, u6]|
| u6| [u6]|
+-----+------------+
I want for each user1 to see the intersections for all_user2.Create a new column that has the maximum intersection with the user1
+-----+------------+------------------------------+
|user1|all_user2 |new_col |
+-----+------------+------------------------------+
|u1 |[u2, u3, u4]|{max_count -> 2, user -> 'u5'}|
|u5 |[u2, u4, u6]|{max_count -> 2, user -> 'u1'}|
|u7 |[u8, u3] |{max_count -> 1, user -> 'u1'}|
|u6 |[u6] |{max_count -> 1, user -> 'u5'}|
+-----+------------+------------------------------+
user1?