Spark Dataframe - Display empty row count for each column

Question

I have dataframe of n columns and I would like to count the number of missing values in each column.

I use the following snippet of code to perform this but the output isn't what I'm expecting:

for (e <- df.columns) {
    var c: Int = df.filter( df(e).isNull || df(e) === "" || df(e).isNaN || 
                            df(e) === "-" || df(e) === "NA").count()
    println(e+":"+c)
}

Output:

column1:
column2:
column3:

How to get the count of missing values correctly based on the logic stated in the snippet?

addmeaning · Accepted Answer · 2018-08-07 12:23:52Z

2

You can do it in a slightly different way.

  import org.apache.spark.sql.functions._

  val df = List[(Integer, Integer, Integer)]((1, null, null),(null, 2,3), (null, 3, null)).toDF("a", "b", "c")

  df.select(df.columns.map(c => count(predicate(col(c))).as(s"nulls in column $c")): _*).show()

  private def predicate(c: Column) = {
    c.isNull || c === "" || c.isNaN || c === "-" || c === "NA"
  }

This code will produce:

+-----------------+-----------------+-----------------+
|nulls in column a|nulls in column b|nulls in column c|
+-----------------+-----------------+-----------------+
|                2|                1|                2|
+-----------------+-----------------+-----------------+

answered Aug 7, 2018 at 12:23

addmeaning

1,3981 gold badge13 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

HISI Over a year ago

why did I get : error: not found: value predicate

HISI Over a year ago

I've done this :

for (e <- df.columns) {     println(e+":"+ df.filter(df(e).isNull || df(e) === "" || df(e).isNaN || df(e) === "-" || df_(e) === "NA").count()) }

addmeaning Over a year ago

@hisi check if you copied private def predicate(c: Column) function also

HISI Over a year ago

of course yes, but I don' know why he still gives me this error msg ?

Gagan Sp · Accepted Answer · 2018-08-07 15:35:31Z

0

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType

var df=List[Tuple2[Integer,Integer]]((1,null),(null,2),(null,3)).toDF("name","rank")
df.show
+----+----+
|name|rank|
+----+----+
|   1|null|
|null|   2|
|null|   3|
+----+----+

val col=df.columns
var dfArray=col.map(colmn=>df.select(lit(colmn).as("colName"),sum(when(df(colmn).isNull || df(colmn)==="" || df(colmn)==="-" || df(colmn).isNaN,1).otherwise(0)).as("missingValues")))
dfArray.tail.foldLeft(dfArray.head)((acc,itr)=>acc.union(itr)).show
//output:
+-------+-------------+
|colName|missingValues|
+-------+-------------+
|   name|            2|
|   rank|            1|
+-------+-------------+

edited Aug 7, 2018 at 15:35

answered Aug 7, 2018 at 12:27

Gagan Sp

173 bronze badges

1 Comment

Gagan Sp Over a year ago

This piece of code is help you to get the missing values in dataFrame of each column

Collectives™ on Stack Overflow

Spark Dataframe - Display empty row count for each column

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related