2

I need to create a data frame in my test. I tried the code below:

StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);

List<String> nums = new ArrayList<String>();
nums.add("value1");
nums.add("value2");

Dataset<Row> df = spark.createDataFrame(nums, structType);

The expected result is :

 +------+------+
 |A     |B     |
 +------+------+
 |value1|value2|
 +------+------+

But it is not accepted. How do I initiate a data frame/Dataset?

2 Answers 2

4

For Spark 3.0 and before, SparkSession instances don't have a method to create dataframe from list of Objects and a StructType.

However, there is a method that can build dataframe from list of rows and a StructType. So to make your code work, you have to change your nums type from ArrayList<String> to ArrayList<Row>. You can do that using RowFactory:

// imports
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;

// code
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);

List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));

Dataset<Row> df = spark.createDataFrame(nums, structType);

// result
// +------+------+
// |A     |B     |
// +------+------+
// |value1|value2|
// +------+------+

If you want to add more rows to your dataframe, just add other rows:

// code
...

List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
nums.add(RowFactory.create("value3", "value4"));

Dataset<Row> df = spark.createDataFrame(nums, structType);

// result
// +------+------+
// |A     |B     |
// +------+------+
// |value1|value2|
// |value3|value4|
// +------+------+

Sign up to request clarification or add additional context in comments.

Comments

2

So this is the cleaner way of doing things.

Step 1: Create a bean class for your custom class. Make sure you have public getter, setter and all args constructor and the class should implement serializable

public class StringWrapper implements Serializable {
  private String key;
  private String value;

  public StringWrapper(String key, String value) {
    this.key = key;
    this.value = value;
  }

  public String getKey() {
    return key;
  }

  public void setKey(String key) {
    this.key = key;
  }

  public String getValue() {
    return value;
  }

  public void setValue(String value) {
    this.value = value;
  }
}

Step 2: Generate data

List<StringWrapper> nums = new ArrayList<>();
nums.add(new StringWrapper("value1", "value2"));

Step 3: Convert it to RDD

JavaRDD<StringWrapper> rdd = javaSparkContext.parallelize(nums);

Step 4: Convert it to dataset

sparkSession.createDataFrame(rdd, StringWrapper.class).show(false);

Step 5 : See results

+------+------+
|key   |value |
+------+------+
|value1|value2|
+------+------+

2 Comments

thank you but I have to initiate a df without any another class, there is no another solution please?(i edited my post with the expected result)
updated the answer ... you either create a bean class or you create a custom struct ... I feel bean class is lot cleaner

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.