I have a Dataframe df that has, among others, a column of groupID; that is, each observation belongs to a specific group. In total there are 8 groups. I would like to sample from each groupID a certain percent of observations (say, 20%). Here is my approach of doing this:
val sample_df = for ( i <- Array.range(0,7) ) yield {
val sel_df = df.filter($"groupID"===i)
sel_df.sample(false,0.2,seed1)
}
The result of this code is:
Array[org.apache.spark.sql.DataFrame] = Array([text: string, groupID: int], [text: string, groupID: int])
I applied flatMap() on sample_df, but I got an error:
val flat_df = sample_df.flatMap(x => x)
<console>:59: error: type mismatch;
found: org.apache.spark.sql.DataFrame
required: scala.collection.GenTraversableOnce[?]
How can I get a sampled dataframe?