Pyspark dataframe with json, iteration to create new dataframe

Question

I have data with the following format:

customer_id	model
1	[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2	[{color: 'red', group: 'A'}]

I need to process it so that I create a new dataframe with the following output:

customer_id	color	group
1	red	A
1	green	B
2	red	A

Now I can do this easily with python:

import pandas as pd
import json

newdf = pd.DataFrame([])

for index, row in df.iterrows():
    s = row['model']
    x = json.loads(s)
    
    colors_list = []
    users_list = []
    groups_list = []
    
    for i in range(len(x)):
        colors_list.append(x[i]['color'])
        users_list.append(row['user_id'])
        groups_list.append(x[i]['group'])
        
    newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))

How can I achieve the same result with pyspark?

I'm showing the first rows and schema of original dataframe:

+-----------+--------------------+
|customer_id|              model |
+-----------+--------------------+
|       3541|[{"score":0.04767...|
|     171811|[{"score":0.04473...|
|      12008|[{"score":0.08043...|
|      78964|[{"score":0.06669...|
|     119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows

root
 |-- user_id: integer (nullable = true)
 |-- groups: string (nullable = true)

werner · Accepted Answer · 2021-04-28 20:42:30Z

3

from_json can parse a string column that contains Json data:

from pyspark.sql import functions as F
from pyspark.sql import types as T

data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
        [2, "[{color: 'red', group: 'A'}]"]]

df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
    .withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
    .withColumn("model", F.explode("model")) \
    .withColumn("color", F.col("model")["color"]) \
    .withColumn("group", F.col("model")["group"]) \
    .drop("model")

Result:

+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
|          1|  red|    A|
|          1|green|    B|
|          2|  red|    A|
+-----------+-----+-----+

answered Apr 28, 2021 at 20:42

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sapehi Over a year ago

Thanks for this. How can I achieve the list of lists in data? The dataframe has 5 million rows, and when I try to get the two columns concatenated in that format to then process it as per your suggestion, I get error: Can not infer schema for type: <type 'unicode'>

werner Over a year ago

@Sapehi my answer assumes that all json strings in the field model of the input dataset have the same schema: an array of color/group combinations. If there are other json strings in the column, maybe you could post an example?

Sapehi Over a year ago

I think it's because I have the original data in a spark dataframe and I don't know how to have those two columns as a list of lists, like you have in your variable 'data'. When I try I get [Row(customer_id=7286, groups=u'[{"color":'red'.....], and that format gives me error later.

werner Over a year ago

@Sapehi the code should work fine with a Spark dataframe. Could you please include the output of originalData.show() and originalData.printSchema() in your question? Maybe this

Sapehi Over a year ago

Thank you. I have added that info to see if it helps :)

|

Collectives™ on Stack Overflow

Pyspark dataframe with json, iteration to create new dataframe

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related