How to automatically drop constant columns in pyspark?

Question

I have a spark dataframe in pyspark and I need to drop all constant columns from my dataframe. Since I don't know which columns are constant I cannot manually unselect the constant columns, i.e. I need an automatic procedure. I am surprised I was not able to find a simple solution on stackoverflow.

Example:

import pandas as pd
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()

d = {'col1': [1, 2, 3, 4, 5], 
     'col2': [1, 2, 3, 4, 5],
     'col3': [0, 0, 0, 0, 0],
     'col4': [0, 0, 0, 0, 0]}
df_panda = pd.DataFrame(data=d)
df_spark = spark.createDataFrame(df_panda)
df_spark.show()

Output:

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   1|   0|   0|
|   2|   2|   0|   0|
|   3|   3|   0|   0|
|   4|   4|   0|   0|
|   5|   5|   0|   0|
+----+----+----+----+

Desired output:

+----+----+
|col1|col2|
+----+----+
|   1|   1|
|   2|   2|
|   3|   3|
|   4|   4|
|   5|   5|
+----+----+

What is the best way to automatically drop constant columns in pyspark?

akuiper · Accepted Answer · 2019-04-21 20:04:22Z

4

Count distinct values in each column first and then drop columns that contain only one distinct value:

import pyspark.sql.functions as f
cnt = df_spark.agg(*(f.countDistinct(c).alias(c) for c in df_spark.columns)).first()
cnt
# Row(col1=5, col2=5, col3=1, col4=1)
df_spark.drop(*[c for c in cnt.asDict() if cnt[c] == 1]).show()
+----+----+
|col1|col2|
+----+----+
|   1|   1|
|   2|   2|
|   3|   3|
|   4|   4|
|   5|   5|
+----+----+

answered Apr 21, 2019 at 20:04

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to automatically drop constant columns in pyspark?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related