Replacing null values with 0 after spark dataframe left outer join

Question

I have two dataframes called left and right.

scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)

scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)

Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.

https://gist.github.com/anonymous/f02bd79528ac75f57ae8

scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")

scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)

Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.

scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])

I want to replace the null values in the realLabelVal column with 1.0.

Currently I do the following:

I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. (This gives me a RDD[Row])
Then I apply the schema of the joined dataframe to get the cleaned dataframe.

The code is as follows:

 val real_labelval_index = 3
 def replaceNull(row: Row) = {
    val rowArray = row.toSeq.toArray
     rowArray(real_labelval_index) = 1.0
     Row.fromSeq(rowArray)
 }

 val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
 val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)

Is there an elegant or efficient way to do this?

Goolging hasn't helped much. Thanks in advance.

What does nat stand for in natjoin?

Josiah Yoder
– Josiah Yoder

2016-08-12 16:20:38 +00:00
Commented Aug 12, 2016 at 16:20 — Josiah Yoder
– Josiah Yoder, Commented Aug 12, 2016 at 16:20
@JosiahYoder nat stands for Natural Join.

Mihir Shinde
– Mihir Shinde

2016-09-12 19:11:26 +00:00
Commented Sep 12, 2016 at 19:11 — Mihir Shinde
– Mihir Shinde, Commented Sep 12, 2016 at 19:11

Justin Pihony · Accepted Answer · 2015-08-04 02:27:37Z

28

Have you tried using na

joinedData.na.fill(1.0, Seq("real_labelval"))

answered Aug 4, 2015 at 2:27

Justin Pihony

67.2k20 gold badges154 silver badges185 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mihir Shinde Over a year ago

Thanks for the quick response. The problem is we use cloudera distribution and the cluster has spark 1.3.0. The fill functions were introduced in spark 1.4 I think. I am accepting this as the answer.

Gavin Niu Over a year ago

Do I need to import anything to use na? Thanks

Justin Pihony Over a year ago

@GavinNiu No, na is a method directly on DataFrame

Geoffrey Anderson Over a year ago

What is Seq() doing?

Justin Pihony Over a year ago

fill takes an array (Seq), thus the wrapper.

|

Collectives™ on Stack Overflow

Replacing null values with 0 after spark dataframe left outer join

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related