How to use string Variable in Spark SQL expression?

Question

I have found similar post here, but some extra issue appear when I apply this to String variable. Let me explain what I am trying to do. I have a single column DataFrame df1 which contains some place information:

+-------+
|place  |
+-------+
|Place A|
|Place B|
|Place C|
+-------+

And another DataFrame df2 as following:

+--+-------+
|id|place  |
+--+-------+
|1| Place A|
|2| Place C|
|3| Place C|
|4| Place B|

I need to loop over df2 to check which place does each id match, and do something on the matched ids. The code snippet is as following:

  val places = df1.distinct.map(_.toString).collect
  for (place <- places){
    val students = df2.where(s"place = '$place'").select("id","place")
    // do something on students (add some unique columns depending the place)
    students.show(2)
}

The error I got is a SQL ParseException:

extraneous input '[' expecting {'(', ....}
== SQL ==
academic_college = [Place A]
-------------------^^^

My understanding now is that this Parse Exception comes from the places Array after I do the collect operation. It inherently contains "[]":

places = Array([Place A], [Place B], [Place C])

My questions are two folds:

I only know how to collect df1 into Array and loop over it to achieve what I want since the operations to each place is different. If we stay with this approach, what is the best way to remove "[]" or changed it to "()" or do something else to resolve the Parse Exception?
Is there any better way to achieve this without collecting (materialize) df1 and keep everything in DataFrame?

You missed quotes. Should be where(s"place = '$place'")

Alper t. Turker
– Alper t. Turker

2018-04-06 11:40:55 +00:00
Commented Apr 6, 2018 at 11:40 — Alper t. Turker
– Alper t. Turker, Commented Apr 6, 2018 at 11:40
Thanks for point this out. I updated the post.

Guanghua Shu
– Guanghua Shu

2018-04-09 01:00:39 +00:00
Commented Apr 9, 2018 at 1:00 — Guanghua Shu
– Guanghua Shu, Commented Apr 9, 2018 at 1:00

koiralo · Accepted Answer · 2018-04-06 05:11:37Z

1

You can get Array[String] from df1 as

val places = df1.distinct().collect().map(_.getString(0))

Now you can select each from the array as

places.foreach(place => {
  val student = df2.where($"place" === place).select("id","place")
  student.show()
})

But make sure this won't efect in your original dataframe.

If df1 is small and can fit in your memory you can collect it in a driver, otherwise, it is not suggested.

If you provide some input and expected output, you could get more help easily.

edited Apr 6, 2018 at 5:11

answered Apr 6, 2018 at 4:23

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Guanghua Shu Over a year ago

Thanks Shankar, this almost solves the problem. I got a different Parse Exception with this:

extraneous input 'Designated' expecting <EOF>(line 1, pos 30) == SQL == place = No Place Designated ------------------------^^^

Any idea why? (No Place Designated is another option in df1 which I did not list out before, and this happens to be the first on after collect)

Guanghua Shu Over a year ago

for (place <- places) { val students = df2.where(s"place = $place") .select("id", "place") students.show(2) } This is what I have tried with the above Parse Exception.

koiralo Over a year ago

you can use foreach as i did above and use === for comparing

Guanghua Shu Over a year ago

Using "===" for comparison does seem to work. I do not know enough to know the exact reason. Is it because SQL expression always looking for a <EOF>? Do you mind explain a bit more why one works and the other does not? Thanks.

koiralo Over a year ago

=== is equality check between two columns which is equivalent to equalTo function, = is an assignment operator which is not used for checking equal.

|

Anahcolus · Accepted Answer · 2018-04-06 04:56:17Z

1

I need to loop over df2 to check which place does each id match, and do something on the matched ids.

collect() and iterating over collected data is expensive as all processing occurs in driver node.

I would suggest you to use join

lets say you have

df1
+-------+
|place  |
+-------+
|Place A|
|Place B|
+-------+

and

df2
+---+-------+
|id |place  |
+---+-------+
|1  |Place A|
|2  |Place C|
|3  |Place C|
|4  |Place B|
+---+-------+

You can get the matching place with id using join as

df2.join(df1, Seq("place"))

which should give you

+-------+---+
|place  |id |
+-------+---+
|Place A|1  |
|Place B|4  |
+-------+---+

And now you can perform your do something on the matched ids on this dataframe.

I hope the answer is helpful

answered Apr 6, 2018 at 4:56

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

4 Comments

Guanghua Shu Over a year ago

Thanks Ramesh, I agree that collect and for is not the best way. Originally, I thought groupBy("place") could be a solution, but I have fairly complicated operation after groupBy. For example, I will use the "id" in each group as the sourceIds for the ParallelPersonalizePageRank and add a column about the PPR rank to each group. I was not familiar with the RelationalGroup after groupBy operation. Do you think this can be done with groupBy followed by the operations that I described?

Anahcolus Over a year ago

Yes you can definitely do that. you can performa aggregation on the ids and pass to pagerank. But for that I will have to know the details of the ParallelPersonalizePageRank and that would be another question I guess.

Guanghua Shu Over a year ago

Sorry Ramesh, I had to accept Shankar's answer for this question since it is more directly related for the collect and for method I am trying this time. I will try to do it with groupBy later. I am sure I will face questions for that. I post a more direct question for that method later. Thank you for you confirmation that groupBy is also a viable direction to try.

Anahcolus Over a year ago

@GuanghuaShu thats alright. I don't want my answer to be accepted if there's better and helpful answer to meet your needs. :)

Collectives™ on Stack Overflow

How to use string Variable in Spark SQL expression?

2 Answers 2

8 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related