I have a data frame df that reads a JSON file as follows:
df = spark.read.json("/myfiles/file1.json")
df.dtypes shows the following columns and data types:
id – string Name - struct address - struct Phone - struct start_date - string years_with_company - int highest_education - string department - string reporting_hierarchy - struct
I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id, start_date, highest_education, and department.
Here is the code I have which partially works, as I only get the last non-struct column department's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:
names = df.schema.names
for col_name in names:
if isinstance(df.schema[col_name].dataType, StructType):
print("Skipping struct column %s "%(col_name))
else:
df1 = df.select(col_name).collect()
I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.