2

How to create a spark data frame from a nested dictionary? I'm new to spark. I do not want to use the pandas data frame.

My dictionary look like:-

{'[email protected]': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 50)},
 '[email protected]': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 35)},
 '[email protected]': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 55)}
}

I want to convert this dict to spark data frame using pyspark data frame.

My expected output:-

                                Date    idle_time
    user_name       
[email protected]   2019-10-21  2019-10-21 01:50:00
[email protected]   2019-10-21  2019-10-21 01:35:00
[email protected]             2019-10-21  2019-10-21 01:55:00

2 Answers 2

4

You need to redo your dictionary and build rows to properly infer the schema.

import datetime
from pyspark.sql import Row

data_dict = {
    '[email protected]': {
        'Date': datetime.date(2019, 10, 21),
        'idle_time': datetime.datetime(2019, 10, 21, 1, 50)
    },
    '[email protected]': {
        'Date': datetime.date(2019, 10, 21),
        'idle_time': datetime.datetime(2019, 10, 21, 1, 35)
    },
    '[email protected]': {
        'Date': datetime.date(2019, 10, 21),
        'idle_time': datetime.datetime(2019, 10, 21, 1, 55)
    }
}

data_as_rows = [Row(**{'user_name': k, **v}) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_rows).select('user_name', 'Date', 'idle_time')

data_df.show(truncate=False)

>>>
+-------------------------+----------+-------------------+
|user_name                |Date      |idle_time          |
+-------------------------+----------+-------------------+
|[email protected]|2019-10-21|2019-10-21 01:50:00|
|[email protected]|2019-10-21|2019-10-21 01:35:00|
|[email protected]          |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+

Note: if you already have the schema prepared and don't need to infer, you can just supply the schema to the createDataFrame function:

import pyspark.sql.types as T

schema = T.StructType([
    T.StructField('user_name', T.StringType(), False),
    T.StructField('Date', T.DateType(), False),
    T.StructField('idle_time', T.TimestampType(), False)
])
data_as_tuples = [(k, v['Date'], v['idle_time']) for k,v in data_dict.items()]

data_df = spark.createDataFrame(data_as_tuples, schema=schema)

data_df.show(truncate=False)

>>>
+-------------------------+----------+-------------------+
|user_name                |Date      |idle_time          |
+-------------------------+----------+-------------------+
|[email protected]|2019-10-21|2019-10-21 01:50:00|
|[email protected]|2019-10-21|2019-10-21 01:35:00|
|[email protected]          |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. I want to know any other method to solve this problem. like using spark.read.json or from_json or get_json_object. @Mike Souder
2

Convert the dictionary to a list of tuples, each tuple will then become a row in Spark DataFrame:

rows = []
for key, value in data.items():
    row = (key,value['Date'], value['idle_time'])
    rows.append(row)

Define schema for your data:

from pyspark.sql.types import *

sch = StructType([
    StructField('user_name', StringType()),
    StructField('date', DateType()),
    StructField('idle_time', TimestampType())
])

Create the Spark DataFrame:

df = spark.createDataFrame(rows, sch)

df.show()
+--------------------+----------+-------------------+
|           user_name|      date|          idle_time|
+--------------------+----------+-------------------+
|prathameshsalap@g...|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143@g...|2019-10-21|2019-10-21 01:35:00|
|     [email protected]|2019-10-21|2019-10-21 01:55:00|
+--------------------+----------+-------------------+

1 Comment

Thanks. I want to know any other method to solve this problem. like using spark.read.json or from_json or get_json_object. @David Vrba

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.