scala - Spark : How to union all dataframe in loop

Question

Is there a way to get the dataframe that union dataframe in loop?

This is a sample code:

var fruits = List(
  "apple"
  ,"orange"
  ,"melon"
) 

for (x <- fruits){         
  var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}

I would want to obtain some like this:

aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon

Thanks again

What is this code ? and what are you actually trying to do here ? — sarveshseri
– sarveshseri, Commented Apr 19, 2017 at 8:30

Ramon · Accepted Answer · 2017-04-20 11:58:34Z

28

You could created a sequence of DataFrames and then use reduce:

val results = fruits.
  map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
  reduce(_.union(_))

results.show()

answered Apr 20, 2017 at 11:58

Ramon

8,5024 gold badges39 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ShirishT Over a year ago

simple and beautiful!

codeaperature Over a year ago

Good to see the immutable approach

Community · Accepted Answer · 2017-05-23 10:31:09Z

Steffen Schmitz's answer is the most concise one I believe. Below is a more detailed answer if you are looking for more customization (of field types, etc):

import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row

//initialize DF
val schema = StructType(
  StructField("aCol", StringType, true) ::
  StructField("bCol", StringType, true) ::
  StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)

//list to iterate through
var fruits = List(
    "apple"
    ,"orange"
    ,"melon"
)

for (x <- fruits) {
  //union returns a new dataset
  initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}

//initialDF.show()

references:

Arun Goudar · Accepted Answer · 2018-08-31 12:48:46Z

15

If you have different/multiple dataframes you can use below code, which is efficient.

val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)

answered Aug 31, 2018 at 12:48

Arun Goudar

3813 silver badges6 bronze badges

2 Comments

Regressor Over a year ago

how can i keep adding new dataframes to the Seq using a loop? I would like to do a union at the end, but the dataframes in my Seq are to be added using a loop. Is it doable?

Borja_042 Over a year ago

Why is this efficcient? If you are applying a reduce function to a Scala Seq you are not making use of cluster paralelism and no distributed computing at all, right?

Steffen Schmitz · Accepted Answer · 2017-04-19 13:26:01Z

7

In a for loop:

val fruits = List("apple", "orange", "melon")

( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")

answered Apr 19, 2017 at 13:26

Steffen Schmitz

8903 gold badges18 silver badges34 bronze badges

Comments

Rajat Mishra · Accepted Answer · 2017-04-19 08:56:20Z

1

you can first create a sequence and then use toDF to create Dataframe.

scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()

scala> for ( x <- fruits){
     |  dseq = dseq :+ ("aaa","bbb",x)
     | }

scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))

scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]

scala> df.show
+----+----+------+
|aCol|bCol|  name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+

answered Apr 19, 2017 at 8:56

Rajat Mishra

3,7904 gold badges32 silver badges44 bronze badges

5 Comments

sarveshseri Over a year ago

And why did you feel the need to introduce a var here ?

Rajat Mishra Over a year ago

actually what i tried was to a create a Seq and convert it to dataframe, since i'm iterating through the list of fruit and appending it into a same variable, so i have taken it as var.

sarveshseri Over a year ago

The OP has used var but he did not actually need it. And, you could have just mapped the fruits into your dseq. The important thing to note here is that your dseq is a List. And then you are appending to this list in your for "loop". The problem with this is that append on List is O(n) making your whole dseq generation O(n^2), which will just kill performance on large data.

sarveshseri Over a year ago

Just make it a general principle to avoid append with Scala List.

Rajat Mishra Over a year ago

Thanks @SarveshKumarSingh.

sarveshseri · Accepted Answer · 2017-04-19 09:50:16Z

1

Well... I think your question is a bit mis-guided.

As per my limited understanding of whatever you are trying to do, you should be doing following,

val fruits = List(
  "apple",
  "orange",
  "melon"
)

val df = fruits
  .map(x => ("aaa", "bbb", x))
  .toDF("aCol", "bCol", "name")

And this should be sufficient.

edited Apr 19, 2017 at 9:50

answered Apr 19, 2017 at 8:34

sarveshseri

14k30 silver badges50 bronze badges

2 Comments

J.soo Over a year ago

Thanks Sarvesh.. but I only have to get the union dataframe in Loop.. because there are various operation such as join, withColumn in Loop . I will get the dataframe from hiveSql in Loop.

sarveshseri Over a year ago

"union data-frame in loop" well... just this one statement leaves me unable to answer this question. Why do you need this "union data-frame in loop" ? Can you elaborate in your question with more details about - "various operation such as join, withColumn in Loop".

Collectives™ on Stack Overflow

scala - Spark : How to union all dataframe in loop

6 Answers 6

2 Comments

Comments

2 Comments

Comments

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

2 Comments

Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related