1

I have a column with following structure in my dataframe.

+--------------------+
|                data|
+--------------------+
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
+--------------------+
only showing top 5 rows

The data inside column is a json string. I want to convert the column to some other type (map, struct..). How do I do this with a udf function? I have created a function like this but cant figure out what the return type should be. I tried StructType and MapType which threw error. This is my code.

import json
from pyspark.sql.types import MapType, StructType

udf_getDict = F.udf(lambda x: json.loads(x), StructType)

subset.select(udf_getDict(F.col('data'))).printSchema()

1 Answer 1

2

You can use an approach with spark.read.json and df.rdd.map such as this:

json_string = """
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}
"""
df2 = spark.createDataFrame(
    [
        (1, json_string), 
    ],
    ['id', 'txt'] 
)
df2.dtypes
[('id', 'bigint'), ('txt', 'string')]


new_df = spark.read.json(df2.rdd.map(lambda r: r.txt))
new_df.printSchema()
root
 |-- glossary: struct (nullable = true)
 |    |-- GlossDiv: struct (nullable = true)
 |    |    |-- GlossList: struct (nullable = true)
 |    |    |    |-- GlossEntry: struct (nullable = true)
 |    |    |    |    |-- Abbrev: string (nullable = true)
 |    |    |    |    |-- Acronym: string (nullable = true)
 |    |    |    |    |-- GlossDef: struct (nullable = true)
 |    |    |    |    |    |-- GlossSeeAlso: array (nullable = true)
 |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |-- para: string (nullable = true)
 |    |    |    |    |-- GlossSee: string (nullable = true)
 |    |    |    |    |-- GlossTerm: string (nullable = true)
 |    |    |    |    |-- ID: string (nullable = true)
 |    |    |    |    |-- SortAs: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |-- title: string (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.