PySpark: TypeError: col should be Column

Question

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:

The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:

However, when I try:

dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

it throws error:

PySpark: TypeError: col should be Column

There is no such problem with any other of the keys in the dict, i.e. "value".

I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?

Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:

schemapath = 'default_schema.json'
with open(schemapath) as f:
  d = json.load(f)
  schemaNew = StructType.fromJson(d)
  responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)

however, line

schemaNew = StructType.fromJson(d)

throws following error:

KeyError: 'fields'

No idea, where this 'fields' is coming from...

you can check how the field names are being stored using dataframe.select(col('keywords_exp.*')).columns. this can help you understand if there is an inconsistency in the field names. — samkart
– samkart, Commented Aug 4, 2022 at 9:53
Thanks for your hint! I printed the dataframe after applying your select statement and this is the result: ['major', 'name', 'rank', 'value'] When I try to show the dataframe, it tells me "AttributeError: 'list' object has no attribute 'show'" — Moritz
– Moritz, Commented Aug 4, 2022 at 9:59
Seems to version is ok, one way you can also access this fields is: select("keyword_exp.name","keyword_exp.value") — Khalid Mammadov
– Khalid Mammadov, Commented Aug 4, 2022 at 10:23
The problem is not there, the schema is good as it was at the beginning. OP just needs to access struct fields like they're supposed to be accessed. I've created an example and posted it in the answer. — ZygD
– ZygD, Commented Aug 5, 2022 at 15:38

ZygD · Accepted Answer · 2022-08-05 16:06:47Z

Errors in Spark tell truth.

dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

TypeError: col should be Column

DataFrame.withColumn documentation tells you how its input parameters are called and their data types:

Parameters:
- colName: str
string, name of the new column.
- col: Column
a Column expression for the new column.

So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?

print(type(dataframe.keywords_exp.name))
# <class 'method'>

As can be seen, it's not of the expected type Column...

To get Column from Struct's field, you must use a different syntax.

Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.

To access struct fields, you should be using any of the following options:

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])

(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)

This is a dataframe having the same schema as yours. You can see that withColumn works well:

from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
    [(("N", "glocations", 1, "Cuba"),)],
    'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
#  |-- keywords_exp: struct (nullable = true)
#  |    |-- major: string (nullable = true)
#  |    |-- name: string (nullable = true)
#  |    |-- rank: long (nullable = true)
#  |    |-- value: string (nullable = true)

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df.show()
# +--------------------+-------------+
# |        keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...|   glocations|
# +--------------------+-------------+

That worked, thanks a lot! Though I still dont understand, why I dont need the col function when accessing rank or value, but only for name

Khalid Mammadov · Accepted Answer · 2022-08-04 17:19:58Z

Try setting scheme before reading. Edit: I think the json schema needs to be in specific format. I know it's not documented very well, but you can extract an example using .json() method to see the format and then adjust your schema files. See below updated example:

aa.json

[{"keyword_exp": {"name": "aa", "value": "bb"}}, {"keyword_exp": {"name": "oo", "value": "ee"}}]

test.py

from pyspark.sql.session import SparkSession
import json


if __name__ == '__main__':
    spark = SparkSession.builder.appName("test-app").master("local[1]").getOrCreate()

    from pyspark.sql.types import StructType, StructField, StringType
    schema = StructType([
        StructField('keyword_exp', StructType([
            StructField('name', StringType(), False),
            StructField('value', StringType(), False),
        ])),
    ])

    json_str = schema.json()
    json_obj = json.loads(json_str)
    # Save output of this as file
    print(json_str)
    # Just to see it pretty
    print(json.dumps(json_obj, indent=4))
    # Save to file
    with open("file_schema.json", "w") as f:
        f.write(json_str)
    # Load
    with open("file_schema.json", "r") as f:
        scheme_obj = json.loads(f.read())
    # Re-load
    loaded_schema = StructType.fromJson(scheme_obj)

    df = spark.read.json("./aa.json", schema=schema)
    df.printSchema()
    df = df.select("keyword_exp.name", "keyword_exp.value")
    df.show()

output:

{"fields":[{"metadata":{},"name":"keyword_exp","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":false,"type":"string"},{"metadata":{},"name":"value","nullable":false,"type":"string"}],"type":"struct"}}],"type":"struct"}


{
    "fields": [
        {
            "metadata": {},
            "name": "keyword_exp",
            "nullable": true,
            "type": {
                "fields": [
                    {
                        "metadata": {},
                        "name": "name",
                        "nullable": false,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "value",
                        "nullable": false,
                        "type": "string"
                    }
                ],
                "type": "struct"
            }
        }
    ],
    "type": "struct"
}
root
 |-- keyword_exp: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- value: string (nullable = true)

+----+-----+
|name|value|
+----+-----+
|  aa|   bb|
|  oo|   ee|
+----+-----+

Hi Khalid, thanks for the input. I updated my original post with some more insights

Moritz · Accepted Answer · 2022-08-05 10:23:21Z

-1

the Spark API seems to have problems with certain protected words. I came across this link when googling the error message

AttributeError: ‘function’ object has no attribute

https://learn.microsoft.com/en-us/azure/databricks/kb/python/function-object-no-attribute

Though "name" is not on the list, I changed all "name"-occurences in the JSON to "nameabcde" and now I can access it:

answered Aug 5, 2022 at 10:23

Moritz

6032 gold badges11 silver badges25 bronze badges

Collectives™ on Stack Overflow

PySpark: TypeError: col should be Column

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related