57

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])
TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261 

TypeError: col should be a string or a Column

4 Answers 4

76

In PySpark 2.1.0 method drop supports multiple columns:

PySpark 2.0.2:

DataFrame.drop(col)

PySpark 2.1.0:

DataFrame.drop(*cols)

Example:

df.drop('col1', 'col2')

or using the * operator as

df.drop(*['col1', 'col2'])
Sign up to request clarification or add additional context in comments.

3 Comments

I have a scenario where am using
Just to be clear, in case it isn't obvious to some folks landing here, when @Patrick writes DataFrame.drop(*cols) above, cols is a Python list, and putting the star before it converts it into positional arguments.
That is so unintuitive and unpythonic, but thanks for the answer.
57

Simply with select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

Note:

(difference in execution time):

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

There is a difference however when we analyze driver-side code:

  • the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
  • the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
  • finally comprehensions are significantly faster in Python than methods like map or reduce
  • Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

Comments

9

The right way to do this is:

df.drop(*['col1', 'col2', 'col3'])

The * needs to come outside of the brackets if there are multiple columns to drop.

5 Comments

This doesn't add any new information to this post.The * unpacking is shown in this answer with further explanation of the syntax in this comment
The answer you point to does not work for me: df.drop('col1', 'col2') is incorrect, the columns have to be in brackets and the * needs to be outside the bracket. That's why I posted.
If it's not working for you, your error is somewhere else because the df.drop(*['col1', 'col2']) is syntactically equivalent to df.drop('col1', 'col2')
@pault you're right. For some reason, your method didn't work for me earlier but now it does. In any case, the * is necessary if you do decide to use brackets, so I think it's fair to keep the answer here as a potential alternative solution. Thanks.
@Ceren: How to make this changes happened in the dataframe ? Like it does in python inplace=True, then change is reflected in the dataframe. as noticed df.drop(*cols) returns new dataframe.
0

In case non of the above works for you, try this:

df.drop(col("col1")).drop(col("col2))

My spark version is 3.1.2.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.