creating an empty dataframe

Question

As I'm new to spark i have a simple doubt
i have to create an empty dataframe which I have to populate based on some conditions later on.

I have gone through many questions of creating an empty dataframe but what is difference between these below approach

what i have approached I don't know whether it's right approach or not

def function1(df: DataFrame): DataFrame = {
    var newdf:DataFrame= null;
    if(!x._2(0).column.trim.isEmpty){
          newdf= spark.sql("SELECT f_name,l_name FROM tab1");
        }else{
          newdf= spark.sql("SELECT address,zipcode FROM tab1");
        }

    newdf
  }

The above approach is not giving me any error in while running in local don't know when it comes to cluster.
But I have found other approaches where they have created an empty dataframe with specified schema like this:

val my_schema = StructType(Seq(
    StructField("field1", StringType, nullable = false),
    StructField("field2", StringType, nullable = false)
  ))

val empty: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)

But my problem is I don't have a predefined schema and the resulting dataframe may be of any schema which is something related to runtime where i don't know how the schema will look like.

Is there any problem if I go with approach 1 or anything i'm missing.

Welcome to SO! Can you share how/when are you required to populate the DataFrame? That way we will have more content to discuss on handling the schema. — Kumar Rohit
– Kumar Rohit, Commented Jan 20, 2020 at 10:49
I have a Map[String,String] which I'm iterating based on the value i.e if the value is empty then if condition will execute or else will execute. Similarly my map may contain n no. of k and v and i want the last resulting dataframe. — AmCosmos
– AmCosmos, Commented Jan 20, 2020 at 10:55
what is the error you are facing.. not clear from your question — undefined_variable
– undefined_variable, Commented Jan 20, 2020 at 10:58
val df = spark.emptyDataFrame .. will create empty dataframe without specifying schema — undefined_variable
– undefined_variable, Commented Jan 20, 2020 at 10:59

Emiliano Martinez · Accepted Answer · 2020-01-20 11:19:20Z

Try to avoid the syntax using vars. Dataframes are immutable collections and Scala allows to create expressions in order to create a dataframe. Something like the following code:

def function2(df0: DataFrame)(spark: SparkSession): DataFrame = {
  val df = {
      if(!x._2(0).column.trim.isEmpty){
        spark.sql("SELECT f_name,l_name FROM tab1")
      } else {
        spark.sql("SELECT address,zipcode FROM tab1")
      }
    }
    df
  }

You can create a dataframe from a string array, in which each element is a column name:

val columnNames: List[String] = List("column1", "column2")

// All dataframe columns are of type string
val schema = columnNames.map(StructField(_, StringType, nullable = true))

spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)

If you have another another more complex use case, edit your question and add something more specific...

Collectives™ on Stack Overflow

creating an empty dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related