17

Is there a way to get the dataframe that union dataframe in loop?

This is a sample code:

var fruits = List(
  "apple"
  ,"orange"
  ,"melon"
) 

for (x <- fruits){         
  var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}

I would want to obtain some like this:

aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon

Thanks again

2
  • What is this code ? and what are you actually trying to do here ? Commented Apr 19, 2017 at 8:30
  • This is not a union this is a cartesian product. Commented Apr 19, 2017 at 9:18

6 Answers 6

28

You could created a sequence of DataFrames and then use reduce:

val results = fruits.
  map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
  reduce(_.union(_))

results.show()
Sign up to request clarification or add additional context in comments.

2 Comments

simple and beautiful!
Good to see the immutable approach
21

Steffen Schmitz's answer is the most concise one I believe. Below is a more detailed answer if you are looking for more customization (of field types, etc):

import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row

//initialize DF
val schema = StructType(
  StructField("aCol", StringType, true) ::
  StructField("bCol", StringType, true) ::
  StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)

//list to iterate through
var fruits = List(
    "apple"
    ,"orange"
    ,"melon"
)

for (x <- fruits) {
  //union returns a new dataset
  initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}

//initialDF.show()

references:

Comments

15

If you have different/multiple dataframes you can use below code, which is efficient.

val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)

2 Comments

how can i keep adding new dataframes to the Seq using a loop? I would like to do a union at the end, but the dataframes in my Seq are to be added using a loop. Is it doable?
Why is this efficcient? If you are applying a reduce function to a Scala Seq you are not making use of cluster paralelism and no distributed computing at all, right?
7

In a for loop:

val fruits = List("apple", "orange", "melon")

( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")

Comments

1

you can first create a sequence and then use toDF to create Dataframe.

scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()

scala> for ( x <- fruits){
     |  dseq = dseq :+ ("aaa","bbb",x)
     | }

scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))

scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]

scala> df.show
+----+----+------+
|aCol|bCol|  name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+

5 Comments

And why did you feel the need to introduce a var here ?
actually what i tried was to a create a Seq and convert it to dataframe, since i'm iterating through the list of fruit and appending it into a same variable, so i have taken it as var.
The OP has used var but he did not actually need it. And, you could have just mapped the fruits into your dseq. The important thing to note here is that your dseq is a List. And then you are appending to this list in your for "loop". The problem with this is that append on List is O(n) making your whole dseq generation O(n^2), which will just kill performance on large data.
Just make it a general principle to avoid append with Scala List.
Thanks @SarveshKumarSingh.
1

Well... I think your question is a bit mis-guided.

As per my limited understanding of whatever you are trying to do, you should be doing following,

val fruits = List(
  "apple",
  "orange",
  "melon"
)

val df = fruits
  .map(x => ("aaa", "bbb", x))
  .toDF("aCol", "bCol", "name")

And this should be sufficient.

2 Comments

Thanks Sarvesh.. but I only have to get the union dataframe in Loop.. because there are various operation such as join, withColumn in Loop . I will get the dataframe from hiveSql in Loop.
"union data-frame in loop" well... just this one statement leaves me unable to answer this question. Why do you need this "union data-frame in loop" ? Can you elaborate in your question with more details about - "various operation such as join, withColumn in Loop".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.