Multiple Filter of Dataframe on Spark with Scala

Question

I am trying to filter this txt file

TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car

I simply want to filter every raw and drop it if a column has a null element.

In my sample dataset there are three of them which are null.

However I am getting and empty datascheme when i run the code. Do I miss something?

This is my code in scala

import org.apache.spark.sql.SparkSession

object DataFrameFromCSVFile {

  def main(args:Array[String]):Unit= {

   val spark: SparkSession = SparkSession.builder()
  .master("local[*]")
  .appName("SparkByExample")
  .getOrCreate()

 val filePath="src/main/resources/demodata.txt"

 val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)

 df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show
 }
}

Project is on IntelliJ

Thank you a lot

Naveen Nelamali · Accepted Answer · 2018-12-26 18:08:33Z

2

You can do this multiple ways.. Below is one.

import org.apache.spark.sql.SparkSession

object DataFrameFromCSVFile2 {

  def main(args:Array[String]):Unit= {

    val spark: SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExample")
      .getOrCreate()

    val filePath="src/main/resources/demodata.tx"

    val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)

    val df2 = df.select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
      .filter("Gender is not null")
      .filter("BirthDate is not null")
      .filter("TotalChildren is not null")
      .filter("ProductCategoryName is not null")
    df2.show()

  }
}

Output:

+------+-------------------+---------+-------------+-------------------+
|Gender|          BirthDate|TotalCost|TotalChildren|ProductCategoryName|
+------+-------------------+---------+-------------+-------------------+
|Female|1957-03-06 00:00:00|     5000|            3|             Beauty|
|  Male|1959-03-06 00:00:00|     6000|            4|                Car|
+------+-------------------+---------+-------------+-------------------+

Thanks, Naveen

answered Dec 26, 2018 at 18:08

Naveen Nelamali

1,18412 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Naveen Nelamali Over a year ago

Alternatively you can also try this df.select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName") .where(df("Gender").isNotNull && df("BirthDate").isNotNull && df("TotalChildren").isNotNull && df("ProductCategoryName").isNotNull ).show()

Sc0rpion · Accepted Answer · 2018-12-26 17:38:12Z

0

You can just filter it from the dataframe as below, df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show

answered Dec 26, 2018 at 17:38

Sc0rpion

731 silver badge5 bronze badges

2 Comments

giorgionasis Over a year ago

Cannot resolve overloaded method 'where'. Should I import something?

Sc0rpion Over a year ago

No. Some typo in your code. Please update your modified code.

Collectives™ on Stack Overflow

Multiple Filter of Dataframe on Spark with Scala

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related