1

I have a Map[String, Dataframe]. I want to combine all the data inside that Map into a single Dataframe. Can a dataframe have a column of Map datatype?

def sample(dfs : Map[String,Dataframe]): Dataframe =
{
.........
}

Example:

DF1

id name age
1  aaa  23
2  bbb  34

DF2

game  time  score
ludo  10    20
rummy 30    40 

I pass the above two DFs as Map to the function. Then put data of the each dataframes into a single column of the output dataframe as json format.

out DF

+---------------------------------------------------------------------------------------+
| column1                                                                              |
+---------------------------------------------------------------------------------------+
| [{"id":"1","name":"aaa","age":"23"},{"id":21","name":"bbb","age":"24"}]               |
| [{"game":"ludo","time":"10","score":"20"},{"game":"rummy","time":"30","score":"40"}]  |
+---------------------------------------------------------------------------------------+
4
  • could you provide example of what you want? \n Dataframes support struct and array types not maps Commented Nov 23, 2021 at 13:52
  • I want to put data of multiple dataframes to a single dataframe .If it does not support map then how can I get value of each input Dataframe as struct column to output dataframe Commented Nov 23, 2021 at 14:00
  • How should the dataframes be combined? Can you give an example? Commented Nov 23, 2021 at 14:12
  • I have edited the question with an example kindly check Commented Nov 23, 2021 at 15:13

2 Answers 2

1

You are asking to generate one row per dataframe. Be careful, if one of the dataframes is large enough so that it cannot be contained in one single executor, this code will break.

Let's first generate data and the map dfs of type Map[String, DataFrame].

val df1 = Seq((1, "aaa", 23), (2, "bbb", 34)).toDF("id", "name", "age")
val df2 = Seq(("ludo", 10, 20), ("rummy", 10, 40)).toDF("game", "time", "score")
dfs = Seq(df1, df2)

Then, for each dataframe of the map, we generate two columns. big_map associates each column name of the dataframe to its value (cast in string to have a consistent type). df simply contains the name of the dataframe. We then union all the dataframes with reduce and group by name (that's the part where every single dataframe ends up entirely in one row, and therefore one one executor).

dfs
    .toSeq
    .map{ case (name, df) => df
        .select(map(
             df.columns.flatMap(c => Seq(lit(c), col(c).cast("string"))) : _*
        ) as "big_map")
        .withColumn("df", lit(name))}
    .reduce(_ union _)
    .groupBy("df")
    .agg(collect_list('big_map) as "column1")
    .show(false)
+---+-----------------------------------------------------------------------------------+
|df |column1                                                                            |
+---+-----------------------------------------------------------------------------------+
|df0|[{id -> 1, name -> aaa, age -> 23}, {id -> 2, name -> bbb, age -> 34}]             |
|df1|[{game -> ludo, time -> 10, score -> 20}, {game -> rummy, time -> 10, score -> 40}]|
+---+-----------------------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0

Here is a solution specific to your use-case:

import org.apache.spark.sql._

def sample(dfs : Map[String, DataFrame])(implicit spark: SparkSession): DataFrame =
  dfs
    .values
    .foldLeft(spark.emptyDataFrame)((acc, df) => acc.union(df))

The spark session is required to create the empty DataFrame accumulator to fold on.

Alternatively if you can guarantee the Map is non empty.

def sample(dfs : Map[String, DataFrame]): DataFrame =
  dfs
    .values
    .reduce((acc, df) => acc.union(df))

1 Comment

can you please check the question I have added an example now.Each input dataframe which I get has diff schemas so I want the data of an entire input dataframe to be populated as a column ,so my output dataframe has data of each input dataframe in one column

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.