Pyspark: multiple filter on string column

Question

Assume the below table is pyspark dataframe and I want to apply filter on a column ind on multiple values. How to perform this in pyspark?

ind group people value 
John  1    5    100   
Ram   1    2    2       
John  1    10   80    
Tom   2    20   40    
Tom   1    7    10    
Anil  2    23   30

I am trying following, but without success

filter = ['John', 'Ram']
filtered_df = df.filter("ind == filter ")
filtered_df.show()

How to achieve this in spark?

This does the reverse of what you want: stackoverflow.com/questions/39624277/… - so you know that you need to use the in function/operator. — ernest_k
– ernest_k, Commented Jul 11, 2019 at 8:03

Julian · Accepted Answer · 2021-07-26 12:51:50Z

3

you could use the built in function isin: filtered_df = df.filter(df["ind"].isin(["John", "Ram"])

answered Jul 26, 2021 at 12:51

Julian

313 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Robin Nicole · Accepted Answer · 2019-07-11 08:43:40Z

2

You can use:

filter = ['John', 'Ram']
filtered_df = df.filter("ind in ('John', 'Ram') ")
filtered_df.show()

Or

filter = ['John', 'Ram']
processed_for_pyspark = ', '.join(['\'' + s + '\'' for s in filter])
filtered_df = df.filter("ind in ({}) ".format(processed_for_puspark))
filtered_df.show()

if you want to have your filters in a list. Also note that we use the single equal = instead of the double equal == to test equality in pyspark (like in SQL)

answered Jul 11, 2019 at 8:43

Robin Nicole

6864 silver badges20 bronze badges

Collectives™ on Stack Overflow

Pyspark: multiple filter on string column

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related