1

I have data with the following format:

customer_id model
1 [{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2 [{color: 'red', group: 'A'}]

I need to process it so that I create a new dataframe with the following output:

customer_id color group
1 red A
1 green B
2 red A

Now I can do this easily with python:

import pandas as pd
import json

newdf = pd.DataFrame([])

for index, row in df.iterrows():
    s = row['model']
    x = json.loads(s)
    
    colors_list = []
    users_list = []
    groups_list = []
    
    for i in range(len(x)):
        colors_list.append(x[i]['color'])
        users_list.append(row['user_id'])
        groups_list.append(x[i]['group'])
        
    newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))

How can I achieve the same result with pyspark?

I'm showing the first rows and schema of original dataframe:

+-----------+--------------------+
|customer_id|              model |
+-----------+--------------------+
|       3541|[{"score":0.04767...|
|     171811|[{"score":0.04473...|
|      12008|[{"score":0.08043...|
|      78964|[{"score":0.06669...|
|     119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows

root
 |-- user_id: integer (nullable = true)
 |-- groups: string (nullable = true)

1 Answer 1

3

from_json can parse a string column that contains Json data:

from pyspark.sql import functions as F
from pyspark.sql import types as T

data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
        [2, "[{color: 'red', group: 'A'}]"]]

df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
    .withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
    .withColumn("model", F.explode("model")) \
    .withColumn("color", F.col("model")["color"]) \
    .withColumn("group", F.col("model")["group"]) \
    .drop("model")

Result:

+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
|          1|  red|    A|
|          1|green|    B|
|          2|  red|    A|
+-----------+-----+-----+
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for this. How can I achieve the list of lists in data? The dataframe has 5 million rows, and when I try to get the two columns concatenated in that format to then process it as per your suggestion, I get error: Can not infer schema for type: <type 'unicode'>
@Sapehi my answer assumes that all json strings in the field model of the input dataset have the same schema: an array of color/group combinations. If there are other json strings in the column, maybe you could post an example?
I think it's because I have the original data in a spark dataframe and I don't know how to have those two columns as a list of lists, like you have in your variable 'data'. When I try I get [Row(customer_id=7286, groups=u'[{"color":'red'.....], and that format gives me error later.
@Sapehi the code should work fine with a Spark dataframe. Could you please include the output of originalData.show() and originalData.printSchema() in your question? Maybe this
Thank you. I have added that info to see if it helps :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.