For loop Spark dataframe

Question

I have a Dataframe df that has, among others, a column of groupID; that is, each observation belongs to a specific group. In total there are 8 groups. I would like to sample from each groupID a certain percent of observations (say, 20%). Here is my approach of doing this:

val sample_df = for ( i <- Array.range(0,7) ) yield {  
             val sel_df = df.filter($"groupID"===i)  
             sel_df.sample(false,0.2,seed1)  
             }

The result of this code is:

Array[org.apache.spark.sql.DataFrame] = Array([text: string, groupID: int], [text: string, groupID: int])

I applied flatMap() on sample_df, but I got an error:

val flat_df = sample_df.flatMap(x => x)
         <console>:59: error: type mismatch;
         found: org.apache.spark.sql.DataFrame
         required: scala.collection.GenTraversableOnce[?]

How can I get a sampled dataframe?

Furkan Varol · Accepted Answer · 2016-07-21 11:01:09Z

2

As far as I understood, you are trying to get RDD of Row. For that you can simply call:

val rows: RDD[Row] = sample_df.rdd

To explain the error you get better, flatMap requires something traversable like Option but you supplied just a Row.

Also, to get all data to the driver, you can call:

val rows: Array[Row] = sample_df.collect

answered Jul 21, 2016 at 11:01

Furkan Varol

2522 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

aigujin Over a year ago

Hi, thanks for the reply. Unfortunately, sample_df is an array collection of dataframes (org.apache.spark.sql.DataFrame) and .rdd method does not work on them. What I need is to flat this array collection to just dataframe. That is why I applied flatMap in the first place.

Furkan Varol Over a year ago

Right, sorry about that. Then Rockie Yang's answer is the correct one.

Rockie Yang · Accepted Answer · 2016-07-21 11:04:42Z

1

I guess you wanna sample evenly on each group.

sample_df.reduceLeft((result, df) => result.unionAll(df))

answered Jul 21, 2016 at 11:04

Rockie Yang

4,94533 silver badges35 bronze badges

Comments

Seth Hendrickson · Accepted Answer · 2016-07-21 23:47:02Z

0

It seems to me you just want to take a 20% sample of the entire dataframe? If so, then there is no reason to create 8 different dataframes and then union them back.

df.sample(false, 0.2, seed)

will do the trick. If you want to do different fractions for each groupID then check out df.stat.sampleBy. If you want to be sure that there is exactly 20% of each class in the sample then you'll have to convert to a PairRDD and use stratified sampling like:

df.rdd.map(row => (row(groupIDIndex), row)).sampleByKeyExact(false, Map(0 -> 0.2, 1 -> 0.2, ..., 8 -> 0.2), seed)

answered Jul 21, 2016 at 23:47

Seth Hendrickson

3311 silver badge3 bronze badges

Collectives™ on Stack Overflow

For loop Spark dataframe

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related