PySpark dataframe filter on multiple columns

Question

Using Spark 2.1.1

Below is my data frame

id Name1   Name2

1 Naveen Srikanth 

2 Naveen Srikanth123

3 Naveen 

4 Srikanth Naveen

Now need to filter rows based on two conditions that is 2 and 3 need to be filtered out as name has number's 123 and 3 has null value

using below code to filter only row id 2

df.select("*").filter(df["Name2"].rlike("[0-9]")).show()

got stuck up to include second condition.

Anahcolus · Accepted Answer · 2017-08-23 17:45:21Z

10

doing the following should solve your issue

from pyspark.sql.functions import col
df.filter((!col("Name2").rlike("[0-9]")) | (col("Name2").isNotNull))

edited Aug 23, 2017 at 17:45

answered Aug 23, 2017 at 12:11

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3292373 Over a year ago

For pyspark also same syntax ? as I am using pyspark not scala

user3292373 Over a year ago

used in pyspark as from pyspark.sql.functions import * and snippet which was given Ramesh not working. Also the code that u have given works only for 123 but need it for any numeric number which is between [0-9]

Michel Lemay Over a year ago

It should have been an && as in my example.

user3292373 Over a year ago

Thank you Ramesh . What you told was right but I got the answer with similar code of yours given df.select("*").filter(~df["Name2"].rlike("[0-9]"))

EntryLevelR Over a year ago

Helpful to see the import col statement...other answers did not include this!

Michel Lemay · Accepted Answer · 2017-08-23 13:21:03Z

2

Should be as simple a putting multiple conditions into the filter.

val df = List(
  ("Naveen", "Srikanth"), 
  ("Naveen", "Srikanth123"), 
  ("Naveen", null), 
  ("Srikanth", "Naveen")).toDF("Name1", "Name2")

import spark.sqlContext.implicits._  
df.filter(!$"Name2".isNull && !$"Name2".rlike("[0-9]")).show

or if you prefer not use spark-sql $:

df.filter(!df("Name2").isNull && !df("Name2").rlike("[0-9]")).show

or in Python:

df.filter(df["Name2"].isNotNull() & ~df["Name2"].rlike("[0-9]")).show()

edited Aug 23, 2017 at 13:21

answered Aug 23, 2017 at 11:50

Michel Lemay

2,0942 gold badges18 silver badges35 bronze badges

4 Comments

user3292373 Over a year ago

Getting spark.sqlContext.implicits._ not found Michel and getting invalid operator && and !$ not allowing me to use

Michel Lemay Over a year ago

This import is for '$' and works like a charm in scala REPL as I just tested. When you are in a full feature spark project, you can import it in a scope that have access to the spark session variable (org.apache.spark.sql.SparkSession).

user3292373 Over a year ago

Ya it is not working as I am using pyspark but syntax is in scala. not using scala. my environment is cloudera proj env

Michel Lemay Over a year ago

maybe you shouldn't have tagged the post [scala] then.. I assumed you were familiar with both envs.

Collectives™ on Stack Overflow

PySpark dataframe filter on multiple columns

2 Answers 2

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related