How to exclude multiple columns in Spark dataframe in Python

Question

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])

TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261 

TypeError: col should be a string or a Column

Sheldore · Accepted Answer · 2021-12-12 18:44:00Z

76

In PySpark 2.1.0 method drop supports multiple columns:

PySpark 2.0.2:

DataFrame.drop(col)

PySpark 2.1.0:

DataFrame.drop(*cols)

Example:

df.drop('col1', 'col2')

or using the * operator as

df.drop(*['col1', 'col2'])

edited Dec 12, 2021 at 18:44

Sheldore

39.2k9 gold badges63 silver badges76 bronze badges

answered Feb 4, 2017 at 18:02

Patrick Z

2,4691 gold badge18 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rups N Over a year ago

I have a scenario where am using

Mike Williamson Over a year ago

Just to be clear, in case it isn't obvious to some folks landing here, when @Patrick writes DataFrame.drop(*cols) above, cols is a Python list, and putting the star before it converts it into positional arguments.

arun Feb 14 at 17:37

That is so unintuitive and unpythonic, but thanks for the answer.

zero323 · Accepted Answer · 2017-02-24 15:11:56Z

Simply with select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

Note:

(difference in execution time):

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

There is a difference however when we analyze driver-side code:

the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
finally comprehensions are significantly faster in Python than methods like map or reduce
Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

techytushar · Accepted Answer · 2021-02-17 08:57:26Z

9

The right way to do this is:

df.drop(*['col1', 'col2', 'col3'])

The * needs to come outside of the brackets if there are multiple columns to drop.

edited Feb 17, 2021 at 8:57

techytushar

8036 silver badges18 bronze badges

answered May 13, 2019 at 17:36

Ceren

1011 silver badge5 bronze badges

5 Comments

pault Over a year ago

This doesn't add any new information to this post.The * unpacking is shown in this answer with further explanation of the syntax in this comment

Ceren Over a year ago

The answer you point to does not work for me: df.drop('col1', 'col2') is incorrect, the columns have to be in brackets and the * needs to be outside the bracket. That's why I posted.

pault Over a year ago

If it's not working for you, your error is somewhere else because the df.drop(*['col1', 'col2']) is syntactically equivalent to df.drop('col1', 'col2')

Ceren Over a year ago

@pault you're right. For some reason, your method didn't work for me earlier but now it does. In any case, the * is necessary if you do decide to use brackets, so I think it's fair to keep the answer here as a potential alternative solution. Thanks.

Innovator-programmer Over a year ago

@Ceren: How to make this changes happened in the dataframe ? Like it does in python inplace=True, then change is reflected in the dataframe. as noticed df.drop(*cols) returns new dataframe.

pari · Accepted Answer · 2022-01-17 16:13:50Z

0

In case non of the above works for you, try this:

df.drop(col("col1")).drop(col("col2))

My spark version is 3.1.2.

answered Jan 17, 2022 at 16:13

pari

8089 silver badges12 bronze badges

Collectives™ on Stack Overflow

How to exclude multiple columns in Spark dataframe in Python

4 Answers 4

3 Comments

Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related