1

As I'm new to spark i have a simple doubt
i have to create an empty dataframe which I have to populate based on some conditions later on.

I have gone through many questions of creating an empty dataframe but what is difference between these below approach

what i have approached I don't know whether it's right approach or not

def function1(df: DataFrame): DataFrame = {
    var newdf:DataFrame= null;
    if(!x._2(0).column.trim.isEmpty){
          newdf= spark.sql("SELECT f_name,l_name FROM tab1");
        }else{
          newdf= spark.sql("SELECT address,zipcode FROM tab1");
        }

    newdf
  }

The above approach is not giving me any error in while running in local don't know when it comes to cluster.
But I have found other approaches where they have created an empty dataframe with specified schema like this:

val my_schema = StructType(Seq(
    StructField("field1", StringType, nullable = false),
    StructField("field2", StringType, nullable = false)
  ))

val empty: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)

But my problem is I don't have a predefined schema and the resulting dataframe may be of any schema which is something related to runtime where i don't know how the schema will look like.



Is there any problem if I go with approach 1 or anything i'm missing.

9
  • Welcome to SO! Can you share how/when are you required to populate the DataFrame? That way we will have more content to discuss on handling the schema. Commented Jan 20, 2020 at 10:49
  • I have a Map[String,String] which I'm iterating based on the value i.e if the value is empty then if condition will execute or else will execute. Similarly my map may contain n no. of k and v and i want the last resulting dataframe. Commented Jan 20, 2020 at 10:55
  • edited the question there was some typo Commented Jan 20, 2020 at 10:57
  • what is the error you are facing.. not clear from your question Commented Jan 20, 2020 at 10:58
  • 1
    val df = spark.emptyDataFrame .. will create empty dataframe without specifying schema Commented Jan 20, 2020 at 10:59

1 Answer 1

1

Try to avoid the syntax using vars. Dataframes are immutable collections and Scala allows to create expressions in order to create a dataframe. Something like the following code:

def function2(df0: DataFrame)(spark: SparkSession): DataFrame = {
  val df = {
      if(!x._2(0).column.trim.isEmpty){
        spark.sql("SELECT f_name,l_name FROM tab1")
      } else {
        spark.sql("SELECT address,zipcode FROM tab1")
      }
    }
    df
  }

You can create a dataframe from a string array, in which each element is a column name:

val columnNames: List[String] = List("column1", "column2")

// All dataframe columns are of type string
val schema = columnNames.map(StructField(_, StringType, nullable = true))

spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)

If you have another another more complex use case, edit your question and add something more specific...

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.