Remove all StructType columns from PySpark DataFrame

Question

I have a data frame df that reads a JSON file as follows:

df = spark.read.json("/myfiles/file1.json")

df.dtypes shows the following columns and data types:

id – string
Name - struct
address - struct
Phone - struct
start_date - string
years_with_company - int
highest_education - string
department - string
reporting_hierarchy - struct

I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id, start_date, highest_education, and department.

Here is the code I have which partially works, as I only get the last non-struct column department's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:

names = df.schema.names

for col_name in names:
   if isinstance(df.schema[col_name].dataType, StructType):
      print("Skipping struct column %s "%(col_name))
   else:
      df1 = df.select(col_name).collect()

I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.

Almost dupe of Selecting only numeric/string columns names from a Spark DF in pyspark — pault
– pault, Commented Dec 17, 2018 at 2:49

cs95 · Accepted Answer · 2018-12-18 03:50:36Z

5

Use a list comprehension:

cols_filtered = [
    c for c in df.schema.names 
    if not isinstance(df.schema[c].dataType, StructType) 
]

Or,

# Thank you @pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']

Now, you can pass the result to df.select.

df2 = df.select(*cols_filtered)

edited Dec 18, 2018 at 3:50

answered Dec 16, 2018 at 2:42

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sameer Over a year ago

thank you very much I used a NOT expression because isinstnace gave me only struct type :) - Following worked for me. thank you very much for your help - cols_filtered = [c for c in df.schema.names if not isinstance(df.schema[c].dataType, StructType)] df1 = df.select(*cols_filtered)

cs95 Over a year ago

@Sameer I apologise for the sloppy answer. Thanks for the fix!

pault Over a year ago

or you can do cols_filtered = [c for c, t in df.dtypes if t != 'struct'] and avoid the call to isinstance

Sameer Over a year ago

@pault - Interestingly the original solution works for me and new one does not. May be 'struct' is not a valid data type (dtypes) but that is how internally spark represents a structtype in schema. Comparing a data type against struct is not giving me desired results but checking for isinstance of StructType is working. 'spark.apache.org/docs/1.5.2/api/java/org/apache/spark/sql/types/…. Thank you for your help.

Sameer Over a year ago

@pault sorry for the delay df.dtypes do show struct when printing data types of schema

Collectives™ on Stack Overflow

Remove all StructType columns from PySpark DataFrame

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related