How to create a dataframe using spark java

Question

I need to create a data frame in my test. I tried the code below:

StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);

List<String> nums = new ArrayList<String>();
nums.add("value1");
nums.add("value2");

Dataset<Row> df = spark.createDataFrame(nums, structType);

The expected result is :

 +------+------+
 |A     |B     |
 +------+------+
 |value1|value2|
 +------+------+

But it is not accepted. How do I initiate a data frame/Dataset?

Vincent Doba · Accepted Answer · 2020-11-21 19:14:52Z

For Spark 3.0 and before, SparkSession instances don't have a method to create dataframe from list of Objects and a StructType.

However, there is a method that can build dataframe from list of rows and a StructType. So to make your code work, you have to change your nums type from ArrayList<String> to ArrayList<Row>. You can do that using RowFactory:

// imports
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;

// code
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);

List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));

Dataset<Row> df = spark.createDataFrame(nums, structType);

// result
// +------+------+
// |A     |B     |
// +------+------+
// |value1|value2|
// +------+------+

If you want to add more rows to your dataframe, just add other rows:

// code
...

List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
nums.add(RowFactory.create("value3", "value4"));

Dataset<Row> df = spark.createDataFrame(nums, structType);

// result
// +------+------+
// |A     |B     |
// +------+------+
// |value1|value2|
// |value3|value4|
// +------+------+

Pranav Sawant · Accepted Answer · 2020-11-20 16:46:42Z

2

So this is the cleaner way of doing things.

Step 1: Create a bean class for your custom class. Make sure you have public getter, setter and all args constructor and the class should implement serializable

public class StringWrapper implements Serializable {
  private String key;
  private String value;

  public StringWrapper(String key, String value) {
    this.key = key;
    this.value = value;
  }

  public String getKey() {
    return key;
  }

  public void setKey(String key) {
    this.key = key;
  }

  public String getValue() {
    return value;
  }

  public void setValue(String value) {
    this.value = value;
  }
}

Step 2: Generate data

List<StringWrapper> nums = new ArrayList<>();
nums.add(new StringWrapper("value1", "value2"));

Step 3: Convert it to RDD

JavaRDD<StringWrapper> rdd = javaSparkContext.parallelize(nums);

Step 4: Convert it to dataset

sparkSession.createDataFrame(rdd, StringWrapper.class).show(false);

Step 5 : See results

+------+------+
|key   |value |
+------+------+
|value1|value2|
+------+------+

edited Nov 20, 2020 at 16:46

answered Nov 20, 2020 at 15:48

Pranav Sawant

745 bronze badges

2 Comments

yopi Over a year ago

thank you but I have to initiate a df without any another class, there is no another solution please?(i edited my post with the expected result)

Pranav Sawant Over a year ago

updated the answer ... you either create a bean class or you create a custom struct ... I feel bean class is lot cleaner

Collectives™ on Stack Overflow

How to create a dataframe using spark java

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related