4

I have a data frame df that reads a JSON file as follows:

df = spark.read.json("/myfiles/file1.json")

df.dtypes shows the following columns and data types:

id – string
Name - struct
address - struct
Phone - struct
start_date - string
years_with_company - int
highest_education - string
department - string
reporting_hierarchy - struct

I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id, start_date, highest_education, and department.

Here is the code I have which partially works, as I only get the last non-struct column department's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:

names = df.schema.names

for col_name in names:
   if isinstance(df.schema[col_name].dataType, StructType):
      print("Skipping struct column %s "%(col_name))
   else:
      df1 = df.select(col_name).collect() 

I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.

1

1 Answer 1

5

Use a list comprehension:

cols_filtered = [
    c for c in df.schema.names 
    if not isinstance(df.schema[c].dataType, StructType) 
]    

Or,

# Thank you @pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']

Now, you can pass the result to df.select.

df2 = df.select(*cols_filtered)
Sign up to request clarification or add additional context in comments.

5 Comments

thank you very much I used a NOT expression because isinstnace gave me only struct type :) - Following worked for me. thank you very much for your help - cols_filtered = [c for c in df.schema.names if not isinstance(df.schema[c].dataType, StructType)] df1 = df.select(*cols_filtered)
@Sameer I apologise for the sloppy answer. Thanks for the fix!
or you can do cols_filtered = [c for c, t in df.dtypes if t != 'struct'] and avoid the call to isinstance
@pault - Interestingly the original solution works for me and new one does not. May be 'struct' is not a valid data type (dtypes) but that is how internally spark represents a structtype in schema. Comparing a data type against struct is not giving me desired results but checking for isinstance of StructType is working. 'spark.apache.org/docs/1.5.2/api/java/org/apache/spark/sql/types/…. Thank you for your help.
@pault sorry for the delay df.dtypes do show struct when printing data types of schema

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.