5

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:

enter image description here

The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema: enter image description here

However, when I try:

dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

it throws error:

PySpark: TypeError: col should be Column

There is no such problem with any other of the keys in the dict, i.e. "value".

I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?


Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:

schemapath = 'default_schema.json'
with open(schemapath) as f:
  d = json.load(f)
  schemaNew = StructType.fromJson(d)
  responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)

however, line

schemaNew = StructType.fromJson(d)

throws following error:

KeyError: 'fields'

No idea, where this 'fields' is coming from...

11
  • 1
    you can check how the field names are being stored using dataframe.select(col('keywords_exp.*')).columns. this can help you understand if there is an inconsistency in the field names. Commented Aug 4, 2022 at 9:53
  • Thanks for your hint! I printed the dataframe after applying your select statement and this is the result: ['major', 'name', 'rank', 'value'] When I try to show the dataframe, it tells me "AttributeError: 'list' object has no attribute 'show'" Commented Aug 4, 2022 at 9:59
  • 1
    we don't see any inconsistency in the column name... Commented Aug 4, 2022 at 10:06
  • 1
    Seems to version is ok, one way you can also access this fields is: select("keyword_exp.name","keyword_exp.value") Commented Aug 4, 2022 at 10:23
  • 1
    The problem is not there, the schema is good as it was at the beginning. OP just needs to access struct fields like they're supposed to be accessed. I've created an example and posted it in the answer. Commented Aug 5, 2022 at 15:38

3 Answers 3

8

Errors in Spark tell truth.

dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

TypeError: col should be Column

DataFrame.withColumn documentation tells you how its input parameters are called and their data types:

  • Parameters:
    - colName: str
    string, name of the new column.
    - col: Column
    a Column expression for the new column.

So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?

print(type(dataframe.keywords_exp.name))
# <class 'method'>

As can be seen, it's not of the expected type Column...

To get Column from Struct's field, you must use a different syntax.

Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.

To access struct fields, you should be using any of the following options:

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])

(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)

This is a dataframe having the same schema as yours. You can see that withColumn works well:

from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
    [(("N", "glocations", 1, "Cuba"),)],
    'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
#  |-- keywords_exp: struct (nullable = true)
#  |    |-- major: string (nullable = true)
#  |    |-- name: string (nullable = true)
#  |    |-- rank: long (nullable = true)
#  |    |-- value: string (nullable = true)

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df.show()
# +--------------------+-------------+
# |        keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...|   glocations|
# +--------------------+-------------+
Sign up to request clarification or add additional context in comments.

1 Comment

That worked, thanks a lot! Though I still dont understand, why I dont need the col function when accessing rank or value, but only for name
2

Try setting scheme before reading. Edit: I think the json schema needs to be in specific format. I know it's not documented very well, but you can extract an example using .json() method to see the format and then adjust your schema files. See below updated example:

aa.json

[{"keyword_exp": {"name": "aa", "value": "bb"}}, {"keyword_exp": {"name": "oo", "value": "ee"}}]

test.py

from pyspark.sql.session import SparkSession
import json


if __name__ == '__main__':
    spark = SparkSession.builder.appName("test-app").master("local[1]").getOrCreate()

    from pyspark.sql.types import StructType, StructField, StringType
    schema = StructType([
        StructField('keyword_exp', StructType([
            StructField('name', StringType(), False),
            StructField('value', StringType(), False),
        ])),
    ])

    json_str = schema.json()
    json_obj = json.loads(json_str)
    # Save output of this as file
    print(json_str)
    # Just to see it pretty
    print(json.dumps(json_obj, indent=4))
    # Save to file
    with open("file_schema.json", "w") as f:
        f.write(json_str)
    # Load
    with open("file_schema.json", "r") as f:
        scheme_obj = json.loads(f.read())
    # Re-load
    loaded_schema = StructType.fromJson(scheme_obj)

    df = spark.read.json("./aa.json", schema=schema)
    df.printSchema()
    df = df.select("keyword_exp.name", "keyword_exp.value")
    df.show()


output:

{"fields":[{"metadata":{},"name":"keyword_exp","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":false,"type":"string"},{"metadata":{},"name":"value","nullable":false,"type":"string"}],"type":"struct"}}],"type":"struct"}


{
    "fields": [
        {
            "metadata": {},
            "name": "keyword_exp",
            "nullable": true,
            "type": {
                "fields": [
                    {
                        "metadata": {},
                        "name": "name",
                        "nullable": false,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "value",
                        "nullable": false,
                        "type": "string"
                    }
                ],
                "type": "struct"
            }
        }
    ],
    "type": "struct"
}
root
 |-- keyword_exp: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- value: string (nullable = true)

+----+-----+
|name|value|
+----+-----+
|  aa|   bb|
|  oo|   ee|
+----+-----+

1 Comment

Hi Khalid, thanks for the input. I updated my original post with some more insights
-1

the Spark API seems to have problems with certain protected words. I came across this link when googling the error message

AttributeError: ‘function’ object has no attribute

https://learn.microsoft.com/en-us/azure/databricks/kb/python/function-object-no-attribute

Though "name" is not on the list, I changed all "name"-occurences in the JSON to "nameabcde" and now I can access it:

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.