0

I have found similar post here, but some extra issue appear when I apply this to String variable. Let me explain what I am trying to do. I have a single column DataFrame df1 which contains some place information:

+-------+
|place  |
+-------+
|Place A|
|Place B|
|Place C|
+-------+ 

And another DataFrame df2 as following:

+--+-------+
|id|place  |
+--+-------+
|1| Place A|
|2| Place C|
|3| Place C|
|4| Place B|

I need to loop over df2 to check which place does each id match, and do something on the matched ids. The code snippet is as following:

  val places = df1.distinct.map(_.toString).collect
  for (place <- places){
    val students = df2.where(s"place = '$place'").select("id","place")
    // do something on students (add some unique columns depending the place)
    students.show(2)
} 

The error I got is a SQL ParseException:

extraneous input '[' expecting {'(', ....}
== SQL ==
academic_college = [Place A]
-------------------^^^

My understanding now is that this Parse Exception comes from the places Array after I do the collect operation. It inherently contains "[]":

places = Array([Place A], [Place B], [Place C])

My questions are two folds:

  1. I only know how to collect df1 into Array and loop over it to achieve what I want since the operations to each place is different. If we stay with this approach, what is the best way to remove "[]" or changed it to "()" or do something else to resolve the Parse Exception?

  2. Is there any better way to achieve this without collecting (materialize) df1 and keep everything in DataFrame?

2
  • 1
    You missed quotes. Should be where(s"place = '$place'") Commented Apr 6, 2018 at 11:40
  • Thanks for point this out. I updated the post. Commented Apr 9, 2018 at 1:00

2 Answers 2

1

You can get Array[String] from df1 as

val places = df1.distinct().collect().map(_.getString(0))

Now you can select each from the array as

places.foreach(place => {
  val student = df2.where($"place" === place).select("id","place")
  student.show()
})

But make sure this won't efect in your original dataframe.

If df1 is small and can fit in your memory you can collect it in a driver, otherwise, it is not suggested.

If you provide some input and expected output, you could get more help easily.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks Shankar, this almost solves the problem. I got a different Parse Exception with this: extraneous input 'Designated' expecting <EOF>(line 1, pos 30) == SQL == place = No Place Designated ------------------------^^^ Any idea why? (No Place Designated is another option in df1 which I did not list out before, and this happens to be the first on after collect)
for (place <- places) { val students = df2.where(s"place = $place") .select("id", "place") students.show(2) } This is what I have tried with the above Parse Exception.
you can use foreach as i did above and use === for comparing
Using "===" for comparison does seem to work. I do not know enough to know the exact reason. Is it because SQL expression always looking for a <EOF>? Do you mind explain a bit more why one works and the other does not? Thanks.
=== is equality check between two columns which is equivalent to equalTo function, = is an assignment operator which is not used for checking equal.
|
1

I need to loop over df2 to check which place does each id match, and do something on the matched ids.

collect() and iterating over collected data is expensive as all processing occurs in driver node.

I would suggest you to use join

lets say you have

df1
+-------+
|place  |
+-------+
|Place A|
|Place B|
+-------+

and

df2
+---+-------+
|id |place  |
+---+-------+
|1  |Place A|
|2  |Place C|
|3  |Place C|
|4  |Place B|
+---+-------+

You can get the matching place with id using join as

df2.join(df1, Seq("place"))

which should give you

+-------+---+
|place  |id |
+-------+---+
|Place A|1  |
|Place B|4  |
+-------+---+

And now you can perform your do something on the matched ids on this dataframe.

I hope the answer is helpful

4 Comments

Thanks Ramesh, I agree that collect and for is not the best way. Originally, I thought groupBy("place") could be a solution, but I have fairly complicated operation after groupBy. For example, I will use the "id" in each group as the sourceIds for the ParallelPersonalizePageRank and add a column about the PPR rank to each group. I was not familiar with the RelationalGroup after groupBy operation. Do you think this can be done with groupBy followed by the operations that I described?
Yes you can definitely do that. you can performa aggregation on the ids and pass to pagerank. But for that I will have to know the details of the ParallelPersonalizePageRank and that would be another question I guess.
Sorry Ramesh, I had to accept Shankar's answer for this question since it is more directly related for the collect and for method I am trying this time. I will try to do it with groupBy later. I am sure I will face questions for that. I post a more direct question for that method later. Thank you for you confirmation that groupBy is also a viable direction to try.
@GuanghuaShu thats alright. I don't want my answer to be accepted if there's better and helpful answer to meet your needs. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.