I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach