I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:
The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:

However, when I try:
dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
it throws error:
PySpark: TypeError: col should be Column
There is no such problem with any other of the keys in the dict, i.e. "value".
I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?
Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:
schemapath = 'default_schema.json'
with open(schemapath) as f:
d = json.load(f)
schemaNew = StructType.fromJson(d)
responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)
however, line
schemaNew = StructType.fromJson(d)
throws following error:
KeyError: 'fields'
No idea, where this 'fields' is coming from...


dataframe.select(col('keywords_exp.*')).columns. this can help you understand if there is an inconsistency in the field names.