3

I have a dataset of 2002 variables. All variables are numeric. I first read in the dataset to Spark 1.5.0 and created a Double Type dataframe following the instruction here . Then I converted the dataframe to LabeledPoint following instructions here and here. However, when I tried to print out sample rows in the generated LabeledPoint, I got the "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double" error. Below is the Scala code I used. Sorry for the long code but I hope that will help the debug.

Could anyone please tell me where the error is coming from and how to resolve the problem? Thank you very much for your help!

Below is the Scala code I used:

// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
    arrName = line.split('\t').map(_.trim).toArray
}

// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format

// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.regression.LabeledPoint
    val targetInd = dataDF.columns.indexOf("target")
    val ignored = List("target")
    val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
    val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
     Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
    return dataLP
}

// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double

Update:

Thanks a lot for David Griffin's and zero323's comments below. David is correct. I find that exception is indeed caused by the null values in the data. I replaced the following original code:

 def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}

with this one to impute null values to 0.0 and then the problem is gone:

def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}
3
  • 2
    Could you reduce this to a minimal reproducible example? Commented Mar 16, 2016 at 17:42
  • 3
    Do you have null values in your data? If you cast a null string to Long perhaps that causes a problem. Commented Mar 16, 2016 at 17:47
  • Thanks David and zero323! David is correct. The problem is caused by the null values in the data. I have updated my original post to add a solution to impute null values to 0.0. This problem is now resolved so quickly thanks to your great help! Commented Mar 16, 2016 at 20:56

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.