0

I am new to Apache Spark (and Scala) and I want to apply a simple sql request just after reading a csv file and load it on a DF without the need of creating an additional dataframe nor a temporary view or table.

This is the initial request :

SELECT DISTINCT city from cities
WHERE id IN ("10", "20")
AND year IN ("2017", "2018")

This is what I tried on Scala :

val cities = spark.read.options(Map("header" -> "true", "delimiter" -> ";")).csv("test.csv").select("city").distinct.where(""" id IN ("10", "20") AND year IN ("2017", "2018")"""))

cities.show(20)

But it doesn't work. Concretely, it seems that the problem occurs because it didn't recognize the two other columns in the dataframe (since I selected only one column before). So, I had to select initially those three columns and then save a temporary table (a view) and then select the wanted column in a new dataframe. I find this approach too long and too heavy.

Can you help me to fix this please ??? Thank you !

7
  • 3
    Change the where to filter and move it before select Commented Dec 25, 2018 at 15:29
  • yeah it works !!! thank you very much Commented Dec 25, 2018 at 16:15
  • No problem, amigo Commented Dec 25, 2018 at 16:31
  • @sramalingam24 afaik where and filter is the same, or am I wrong? Commented Dec 25, 2018 at 20:21
  • Yup, one and the same. Just eye candy for SQL folks Commented Dec 25, 2018 at 20:47

3 Answers 3

1

Your solution is almost correct, you just need to move the where statement before the select(..).distinct :

val cities = spark.read
  .options(Map("header" -> "true", "delimiter" -> ";"))
  .csv("test.csv")
  .where($"id".isin("10", "20") and $"year".isin("2017", "2018"))
  .select("city").distinct
Sign up to request clarification or add additional context in comments.

Comments

0

Spark scala API is more imperative than declarative (unlike SQL) which is why after you select("city") you lost all the other fields in the dataframe. and why, as others noted, you should filter/where before you do the select. This is a bit confusing as the Scala DSL is similar in syntax to the SQL

Comments

0

As mentioned by sramalingam24 and Raphael Roth, the where has to be applied before selecting the required field from the DataFrame. Filter and where both gives the same result as shown below. The dropDuplicates() will remove the Duplicates in the city column.

    val cities = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
       .csv("test.csv")
       .filter($"id".isin("10", "20") and $"year".isin("2017", "2018"))
       .select("city")
       .dropDuplicates()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.