2

Using Spark 2.1.1

Below is my data frame

id Name1   Name2

1 Naveen Srikanth 

2 Naveen Srikanth123

3 Naveen 

4 Srikanth Naveen

Now need to filter rows based on two conditions that is 2 and 3 need to be filtered out as name has number's 123 and 3 has null value

using below code to filter only row id 2

df.select("*").filter(df["Name2"].rlike("[0-9]")).show()

got stuck up to include second condition.

2 Answers 2

10

doing the following should solve your issue

from pyspark.sql.functions import col
df.filter((!col("Name2").rlike("[0-9]")) | (col("Name2").isNotNull))
Sign up to request clarification or add additional context in comments.

5 Comments

For pyspark also same syntax ? as I am using pyspark not scala
used in pyspark as from pyspark.sql.functions import * and snippet which was given Ramesh not working. Also the code that u have given works only for 123 but need it for any numeric number which is between [0-9]
It should have been an && as in my example.
Thank you Ramesh . What you told was right but I got the answer with similar code of yours given df.select("*").filter(~df["Name2"].rlike("[0-9]"))
Helpful to see the import col statement...other answers did not include this!
2

Should be as simple a putting multiple conditions into the filter.

val df = List(
  ("Naveen", "Srikanth"), 
  ("Naveen", "Srikanth123"), 
  ("Naveen", null), 
  ("Srikanth", "Naveen")).toDF("Name1", "Name2")

import spark.sqlContext.implicits._  
df.filter(!$"Name2".isNull && !$"Name2".rlike("[0-9]")).show

or if you prefer not use spark-sql $:

df.filter(!df("Name2").isNull && !df("Name2").rlike("[0-9]")).show 

or in Python:

df.filter(df["Name2"].isNotNull() & ~df["Name2"].rlike("[0-9]")).show()

4 Comments

Getting spark.sqlContext.implicits._ not found Michel and getting invalid operator && and !$ not allowing me to use
This import is for '$' and works like a charm in scala REPL as I just tested. When you are in a full feature spark project, you can import it in a scope that have access to the spark session variable (org.apache.spark.sql.SparkSession).
Ya it is not working as I am using pyspark but syntax is in scala. not using scala. my environment is cloudera proj env
maybe you shouldn't have tagged the post [scala] then.. I assumed you were familiar with both envs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.