I have data with the following format:
| customer_id | model |
|---|---|
| 1 | [{color: 'red', group: 'A'},{color: 'green', group: 'B'}] |
| 2 | [{color: 'red', group: 'A'}] |
I need to process it so that I create a new dataframe with the following output:
| customer_id | color | group |
|---|---|---|
| 1 | red | A |
| 1 | green | B |
| 2 | red | A |
Now I can do this easily with python:
import pandas as pd
import json
newdf = pd.DataFrame([])
for index, row in df.iterrows():
s = row['model']
x = json.loads(s)
colors_list = []
users_list = []
groups_list = []
for i in range(len(x)):
colors_list.append(x[i]['color'])
users_list.append(row['user_id'])
groups_list.append(x[i]['group'])
newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))
How can I achieve the same result with pyspark?
I'm showing the first rows and schema of original dataframe:
+-----------+--------------------+
|customer_id| model |
+-----------+--------------------+
| 3541|[{"score":0.04767...|
| 171811|[{"score":0.04473...|
| 12008|[{"score":0.08043...|
| 78964|[{"score":0.06669...|
| 119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows
root
|-- user_id: integer (nullable = true)
|-- groups: string (nullable = true)