I have found similar post here, but some extra issue appear when I apply this to String variable. Let me explain what I am trying to do. I have a single column DataFrame df1 which contains some place information:
+-------+
|place |
+-------+
|Place A|
|Place B|
|Place C|
+-------+
And another DataFrame df2 as following:
+--+-------+
|id|place |
+--+-------+
|1| Place A|
|2| Place C|
|3| Place C|
|4| Place B|
I need to loop over df2 to check which place does each id match, and do something on the matched ids. The code snippet is as following:
val places = df1.distinct.map(_.toString).collect
for (place <- places){
val students = df2.where(s"place = '$place'").select("id","place")
// do something on students (add some unique columns depending the place)
students.show(2)
}
The error I got is a SQL ParseException:
extraneous input '[' expecting {'(', ....}
== SQL ==
academic_college = [Place A]
-------------------^^^
My understanding now is that this Parse Exception comes from the places Array after I do the collect operation. It inherently contains "[]":
places = Array([Place A], [Place B], [Place C])
My questions are two folds:
I only know how to collect df1 into Array and loop over it to achieve what I want since the operations to each place is different. If we stay with this approach, what is the best way to remove "[]" or changed it to "()" or do something else to resolve the Parse Exception?
Is there any better way to achieve this without collecting (materialize) df1 and keep everything in DataFrame?
where(s"place = '$place'")