0

I am trying to filter this txt file

TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car

I simply want to filter every raw and drop it if a column has a null element.

In my sample dataset there are three of them which are null.

However I am getting and empty datascheme when i run the code. Do I miss something?

This is my code in scala

import org.apache.spark.sql.SparkSession

object DataFrameFromCSVFile {

  def main(args:Array[String]):Unit= {

   val spark: SparkSession = SparkSession.builder()
  .master("local[*]")
  .appName("SparkByExample")
  .getOrCreate()

 val filePath="src/main/resources/demodata.txt"

 val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)

 df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show
 }
}

Project is on IntelliJ

Thank you a lot

2 Answers 2

2

You can do this multiple ways.. Below is one.

import org.apache.spark.sql.SparkSession

object DataFrameFromCSVFile2 {

  def main(args:Array[String]):Unit= {

    val spark: SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExample")
      .getOrCreate()

    val filePath="src/main/resources/demodata.tx"

    val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)

    val df2 = df.select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
      .filter("Gender is not null")
      .filter("BirthDate is not null")
      .filter("TotalChildren is not null")
      .filter("ProductCategoryName is not null")
    df2.show()

  }
}

Output:

+------+-------------------+---------+-------------+-------------------+
|Gender|          BirthDate|TotalCost|TotalChildren|ProductCategoryName|
+------+-------------------+---------+-------------+-------------------+
|Female|1957-03-06 00:00:00|     5000|            3|             Beauty|
|  Male|1959-03-06 00:00:00|     6000|            4|                Car|
+------+-------------------+---------+-------------+-------------------+

Thanks, Naveen

Sign up to request clarification or add additional context in comments.

1 Comment

Alternatively you can also try this df.select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName") .where(df("Gender").isNotNull && df("BirthDate").isNotNull && df("TotalChildren").isNotNull && df("ProductCategoryName").isNotNull ).show()
0

You can just filter it from the dataframe as below, df.where(!$"Gender".isNull && !$"TotalChildren".isNull).show

2 Comments

Cannot resolve overloaded method 'where'. Should I import something?
No. Some typo in your code. Please update your modified code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.