Can anyone tell me how to write queries using spark-shell for .csv file?
What I have achieved was to read a .csv file using databricks library and create a dataframe as shown below:
./spark-shell --packages.com.databricks:spark-csv_2.10:1.4.0
import org.apache.spark.sql.SQLContext
val sqlContect = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv") .option("header", "true").load("mylocalpath.csv")
Then I can do df.printSchema() and other datafram operations without any problem. But I was wondering how can I write some queries?
I saw the instruction on http://spark.apache.org/docs/latest/sql-programming-guide.html and it mentions something about Programmatically Specifying the Schema, I followed its procedure and just to read .csv file insteading of textfile, but when I did val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)), I got an error saying value split is not a memeber of org.apache.spark.sql.Row. How can I fix this problem?
And if there are some easier method to write sql queries, please let me know. What I want to do ultimately is something like select two columns, one for id, one for price and returen the highest price as simple as that.
df.printSchema() looks like this:
|-- TAXROLL_NUMBER: string (nullable = true)
|-- BUILDING_NAME: string (nullable = true)
|-- ASSESSED_VALUE: string (nullable = true)
|-- STREET_NAME: string (nullable = true)
|-- POSTAL_CODE: string (nullable = true)
|-- CITY: string (nullable = true)
|-- BUILD_YEAR: string (nullable = true)
|-- Lon: string (nullable = true)
|-- Lat: string (nullable = true)
printSchemais? Once you have a validDataFramewith a valid schema, you are good to go in terms of querying. If you print the schema, I'll show you how.