2

I have dataframe of n columns and I would like to count the number of missing values in each column.

I use the following snippet of code to perform this but the output isn't what I'm expecting:

for (e <- df.columns) {
    var c: Int = df.filter( df(e).isNull || df(e) === "" || df(e).isNaN || 
                            df(e) === "-" || df(e) === "NA").count()
    println(e+":"+c)
}

Output:

column1:
column2:
column3:

How to get the count of missing values correctly based on the logic stated in the snippet?

2 Answers 2

2

You can do it in a slightly different way.

  import org.apache.spark.sql.functions._

  val df = List[(Integer, Integer, Integer)]((1, null, null),(null, 2,3), (null, 3, null)).toDF("a", "b", "c")

  df.select(df.columns.map(c => count(predicate(col(c))).as(s"nulls in column $c")): _*).show()

  private def predicate(c: Column) = {
    c.isNull || c === "" || c.isNaN || c === "-" || c === "NA"
  }

This code will produce:

+-----------------+-----------------+-----------------+
|nulls in column a|nulls in column b|nulls in column c|
+-----------------+-----------------+-----------------+
|                2|                1|                2|
+-----------------+-----------------+-----------------+
Sign up to request clarification or add additional context in comments.

4 Comments

why did I get : error: not found: value predicate
I've done this : for (e <- df.columns) { println(e+":"+ df.filter(df(e).isNull || df(e) === "" || df(e).isNaN || df(e) === "-" || df_(e) === "NA").count()) }
@hisi check if you copied private def predicate(c: Column) function also
of course yes, but I don' know why he still gives me this error msg ?
0
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType

var df=List[Tuple2[Integer,Integer]]((1,null),(null,2),(null,3)).toDF("name","rank")
df.show
+----+----+
|name|rank|
+----+----+
|   1|null|
|null|   2|
|null|   3|
+----+----+

val col=df.columns
var dfArray=col.map(colmn=>df.select(lit(colmn).as("colName"),sum(when(df(colmn).isNull || df(colmn)==="" || df(colmn)==="-" || df(colmn).isNaN,1).otherwise(0)).as("missingValues")))
dfArray.tail.foldLeft(dfArray.head)((acc,itr)=>acc.union(itr)).show
//output:
+-------+-------------+
|colName|missingValues|
+-------+-------------+
|   name|            2|
|   rank|            1|
+-------+-------------+

1 Comment

This piece of code is help you to get the missing values in dataFrame of each column

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.