Optimize a where request on Spark scala

Question

I am new to Apache Spark (and Scala) and I want to apply a simple sql request just after reading a csv file and load it on a DF without the need of creating an additional dataframe nor a temporary view or table.

This is the initial request :

SELECT DISTINCT city from cities
WHERE id IN ("10", "20")
AND year IN ("2017", "2018")

This is what I tried on Scala :

val cities = spark.read.options(Map("header" -> "true", "delimiter" -> ";")).csv("test.csv").select("city").distinct.where(""" id IN ("10", "20") AND year IN ("2017", "2018")"""))

cities.show(20)

But it doesn't work. Concretely, it seems that the problem occurs because it didn't recognize the two other columns in the dataframe (since I selected only one column before). So, I had to select initially those three columns and then save a temporary table (a view) and then select the wanted column in a new dataframe. I find this approach too long and too heavy.

Can you help me to fix this please ??? Thank you !

@sramalingam24 afaik where and filter is the same, or am I wrong? — Raphael Roth
– Raphael Roth, Commented Dec 25, 2018 at 20:21

Raphael Roth · Accepted Answer · 2018-12-25 20:26:52Z

1

Your solution is almost correct, you just need to move the where statement before the select(..).distinct :

val cities = spark.read
  .options(Map("header" -> "true", "delimiter" -> ";"))
  .csv("test.csv")
  .where($"id".isin("10", "20") and $"year".isin("2017", "2018"))
  .select("city").distinct

answered Dec 25, 2018 at 20:26

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Arnon Rotem-Gal-Oz · Accepted Answer · 2018-12-25 20:37:41Z

0

Spark scala API is more imperative than declarative (unlike SQL) which is why after you select("city") you lost all the other fields in the dataframe. and why, as others noted, you should filter/where before you do the select. This is a bit confusing as the Scala DSL is similar in syntax to the SQL

answered Dec 25, 2018 at 20:37

Arnon Rotem-Gal-Oz

26k3 gold badges51 silver badges70 bronze badges

Comments

Vijay B · Accepted Answer · 2018-12-25 22:14:04Z

0

As mentioned by sramalingam24 and Raphael Roth, the where has to be applied before selecting the required field from the DataFrame. Filter and where both gives the same result as shown below. The dropDuplicates() will remove the Duplicates in the city column.

    val cities = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
       .csv("test.csv")
       .filter($"id".isin("10", "20") and $"year".isin("2017", "2018"))
       .select("city")
       .dropDuplicates()

answered Dec 25, 2018 at 22:14

Vijay B

711 silver badge9 bronze badges

Collectives™ on Stack Overflow

Optimize a where request on Spark scala

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related