Pyspark : Change nested column datatype

Question

How can we change the datatype of a nested column in Pyspark? For rxample, how can I change the data type of value from string to int?

Reference:how to change a Dataframe column from String type to Double type in pyspark

{
    "x": "12",
    "y": {
        "p": {
            "name": "abc",
            "value": "10"
        },
        "q": {
            "name": "pqr",
            "value": "20"
        }
    }
}

1. Does this change need to be persistent, with changes saved to the json file? Or do you need the precision while you are performing an operation? — diek
– diek, Commented Aug 24, 2017 at 0:27

pauli · Accepted Answer · 2017-09-13 11:29:43Z

2

You can read the json data using

from pyspark import SQLContext

sqlContext = SQLContext(sc)
data_df = sqlContext.read.json("data.json", multiLine = True)

data_df.printSchema()

output

root
 |-- x: long (nullable = true)
 |-- y: struct (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)
 |    |-- q: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)

Now you can access the data from y column as

data_df.select("y.p.name")
data_df.select("y.p.value")

output

abc, 10

Ok, the solution is to add a new nested column with correct schema and drop the column with wrong schema

from pyspark.sql.functions import *
from pyspark.sql import Row

df3 = spark.read.json("data.json", multiLine = True)

# create correct schema from old 
c = df3.schema['y'].jsonValue()
c['name'] = 'z'
c['type']['fields'][0]['type']['fields'][1]['type'] = 'long'
c['type']['fields'][1]['type']['fields'][1]['type'] = 'long'

y_schema = StructType.fromJson(c['type'])

# define a udf to populate the new column. Row are immuatable so you 
# have to build it from start.

def foo(row):
    d = Row.asDict(row)
    y = {}
    y["p"] = {}
    y["p"]["name"] = d["p"]["name"]
    y["p"]["value"] = int(d["p"]["value"])
    y["q"] = {}
    y["q"]["name"] = d["q"]["name"]
    y["q"]["value"] = int(d["p"]["value"])

    return(y)
map_foo = udf(foo, y_schema)

# add the column
df3_new  = df3.withColumn("z", map_foo("y"))

# delete the column
df4 = df3_new.drop("y")


df4.printSchema()

output

root
 |-- x: long (nullable = true)
 |-- z: struct (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)
 |    |-- q: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)


df4.show()

output

+---+-------------------+
|  x|                  z|
+---+-------------------+
| 12|[[abc,10],[pqr,10]]|
+---+-------------------+

edited Sep 13, 2017 at 11:29

answered Aug 23, 2017 at 13:49

pauli

4,3112 gold badges28 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

J.D Over a year ago

@aswinids I have edited the question. Any thoughts on this one?

J.D Over a year ago

@aswinids : Thanks for helping. Do we have decima/timestamp data type in json schema?

J.D Over a year ago

@aswinids: If I change the value of 10 to "10" and use type: 'long', I get null

J.D Over a year ago

@zero323 Do you have any idea ?

pauli Over a year ago

@J.D It's working completely fine with above json_schema. can you check it again? And yes, I'm reading the json file after converting values to string.

|

diek · Accepted Answer · 2017-08-24 01:55:13Z

0

It may seem simple to use arbitrary variable names but this is problematic and contrary to PEP8. And when dealing with numbers, I suggest avoiding the common names used in iterating over such structures... ie, value.

import json

with open('random.json') as json_file:
    data = json.load(json_file)

for k, v in data.items():
    if k == 'y':
        for key, item in v.items():
            item['value'] = float(item['value'])


print(type(data['y']['p']['value']))
print(type(data['y']['q']['value']))
# mac → python3 make_float.py
# <class 'float'>
# <class 'float'>
json_data = json.dumps(data, indent=4, sort_keys=True)
with open('random.json', 'w') as json_file:
    json_file.write(json_data)

answered Aug 24, 2017 at 1:55

diek

6957 silver badges18 bronze badges

2 Comments

J.D Over a year ago

The crucial part of this problem is that we have around 60GB of data produced everyday and we need to ensure the scalability and thats why Spark was the way out

diek Over a year ago

Of course this would not be able o handle such a massive amount of data. Why did the question you referenced not work? From the documentation they give an example of dealing with this: ghostbin.com/paste/wt5y6

Collectives™ on Stack Overflow

Pyspark : Change nested column datatype

2 Answers 2

10 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related