Scala java.lang.String cannot be cast to java.lang.Double error when converting double type dataframe to LabeledPoint in Spark

Ask Question

Asked 9 years, 8 months ago

Modified 9 years, 8 months ago

Viewed 9k times

I have a dataset of 2002 variables. All variables are numeric. I first read in the dataset to Spark 1.5.0 and created a Double Type dataframe following the instruction here . Then I converted the dataframe to LabeledPoint following instructions here and here. However, when I tried to print out sample rows in the generated LabeledPoint, I got the "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double" error. Below is the Scala code I used. Sorry for the long code but I hope that will help the debug.

Could anyone please tell me where the error is coming from and how to resolve the problem? Thank you very much for your help!

Below is the Scala code I used:

// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
    arrName = line.split('\t').map(_.trim).toArray
}

// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format

// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.regression.LabeledPoint
    val targetInd = dataDF.columns.indexOf("target")
    val ignored = List("target")
    val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
    val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
     Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
    return dataLP
}

// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double

Update:

Thanks a lot for David Griffin's and zero323's comments below. David is correct. I find that exception is indeed caused by the null values in the data. I replaced the following original code:

 def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}

with this one to impute null values to 0.0 and then the problem is gone:

def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}

edited May 23, 2017 at 12:32

CommunityBot

11 silver badge

asked Mar 16, 2016 at 17:33

machine_learner

2252 silver badges10 bronze badges

2

Could you reduce this to a minimal reproducible example?

zero323
– zero323

2016-03-16 17:42:07 +00:00
Commented Mar 16, 2016 at 17:42
3

Do you have null values in your data? If you cast a null string to Long perhaps that causes a problem.

David Griffin
– David Griffin

2016-03-16 17:47:01 +00:00
Commented Mar 16, 2016 at 17:47
Thanks David and zero323! David is correct. The problem is caused by the null values in the data. I have updated my original post to add a solution to impute null values to 0.0. This problem is now resolved so quickly thanks to your great help!

machine_learner
– machine_learner

2016-03-16 20:56:04 +00:00
Commented Mar 16, 2016 at 20:56

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Scala java.lang.String cannot be cast to java.lang.Double error when converting double type dataframe to LabeledPoint in Spark

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked