I am new to Apache Spark (and Scala) and I want to apply a simple sql request just after reading a csv file and load it on a DF without the need of creating an additional dataframe nor a temporary view or table.
This is the initial request :
SELECT DISTINCT city from cities
WHERE id IN ("10", "20")
AND year IN ("2017", "2018")
This is what I tried on Scala :
val cities = spark.read.options(Map("header" -> "true", "delimiter" -> ";")).csv("test.csv").select("city").distinct.where(""" id IN ("10", "20") AND year IN ("2017", "2018")"""))
cities.show(20)
But it doesn't work. Concretely, it seems that the problem occurs because it didn't recognize the two other columns in the dataframe (since I selected only one column before). So, I had to select initially those three columns and then save a temporary table (a view) and then select the wanted column in a new dataframe. I find this approach too long and too heavy.
Can you help me to fix this please ??? Thank you !
whereandfilteris the same, or am I wrong?