spark scala issue uploading csv

Question

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues. First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")

but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"

Then I uploaded the file and read it as a text file, without the headings as:

val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()

gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work. I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...

any other thoughts please let me know as well

zero323 · Accepted Answer · 2015-07-30 16:48:17Z

1

To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:

bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0

Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:

how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines

In case of example file you have to at least:

filter empty lines
read header
map lines to fields providing default value if field is missing

edited Jul 30, 2015 at 16:48

answered Jul 30, 2015 at 16:41

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Abu Shoeb · Accepted Answer · 2017-10-27 08:24:05Z

0

Here you go. Remember to check the delimiter for your CSV.

// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
        .master("local")
        .appName("Spark CSV Reader")
        .getOrCreate;

// read csv
val df = spark.read
         .format("csv")
         .option("header", "true") //reading the headers
         .option("mode", "DROPMALFORMED")
         .option("delimiter", ",")
         .load("/your/csv/dir/simplecsv.csv")

// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)

answered Oct 27, 2017 at 8:24

Abu Shoeb

5,1924 gold badges47 silver badges53 bronze badges

Collectives™ on Stack Overflow

spark scala issue uploading csv

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related