Creating a simple 1-row Spark DataFrame with Java API

Question

In Scala, I can create a single-row DataFrame from an in-memory string like so:

val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()

When df.show() runs, it outputs:

+-----+
| fizz|
+-----+
| buzz|
+-----+

Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:

List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
    .parallelize(stringAsList), StringType);
df.show();

...but still seem to be coming up short. Now when df.show(); executes, I get:

++
||
++
||
++

(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?

I think, in your first example, you have a typo and should replace jsonValues -> stringAsList I'm still new to this syntax though — Sarah Messer
– Sarah Messer, Commented Jun 30, 2020 at 14:31

jgp · Accepted Answer · 2017-04-18 15:29:44Z

I have created 2 examples for Spark 2 if you need to upgrade:

Simple Fizz/Buzz (or foe/bar - old generation :) ):

    SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
            .getOrCreate();

    List<String> stringAsList = new ArrayList<>();
    stringAsList.add("bar");

    JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());

    JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> RowFactory.create(row));

    // Creates schema
    StructType schema = DataTypes.createStructType(
            new StructField[] { DataTypes.createStructField("foe", DataTypes.StringType, false) });

    Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();

2x2 data:

    SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
            .getOrCreate();

    List<String[]> stringAsList = new ArrayList<>();
    stringAsList.add(new String[] { "bar1.1", "bar2.1" });
    stringAsList.add(new String[] { "bar1.2", "bar2.2" });

    JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());

    JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));

    // Creates schema
    StructType schema = DataTypes
            .createStructType(new StructField[] { DataTypes.createStructField("foe1", DataTypes.StringType, false),
                    DataTypes.createStructField("foe2", DataTypes.StringType, false) });

    Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();

Code can be downloaded from: https://github.com/jgperrin/net.jgp.labs.spark.

How can this be done for mixed data type ? Let's say if values to create data frame is "bar1.1" (String) and 10(Int)

cody123 · Accepted Answer · 2016-10-11 03:05:39Z

8

You can achieve this by creating List to Rdd and than create Schema which will contain column name.

There might be other ways as well, it's just one of them.

List<String> stringAsList = new ArrayList<String>();
        stringAsList.add("buzz");

JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> {
                return RowFactory.create(row);
            });

StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("fizz", DataTypes.StringType, false) });

DataFrame df = sqlContext.createDataFrame(rowRDD, schema).toDF();
df.show();

//+----+
|fizz|
+----+
|buzz|

answered Oct 11, 2016 at 3:05

cody123

2,0901 gold badge26 silver badges30 bronze badges

Comments

cauchy_cat · Accepted Answer · 2021-05-06 10:32:51Z

1

Building on what @jgp suggested. If you want to do this for mixed types you can do:

List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
                new Tuple2<>(1, false),
                new Tuple2<>(1, false),
                new Tuple2<>(1, false));

JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));

StructType mySchema = new StructType()
                .add("id", DataTypes.IntegerType, false)
                .add("flag", DataTypes.BooleanType, false);

Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();

This might help with the @jdk2588 's question.

answered May 6, 2021 at 10:32

cauchy_cat

811 gold badge1 silver badge5 bronze badges

2 Comments

Martin Wunderlich Over a year ago

I have stumbled over the same issue and it seems that Spark MLLib v. 3.0.2 does not provide a method sparkContext.parallelize() that takes only a list as parameter. So, it would be good to know with which version this code is working (applies to the other replies, too).

Fragan Over a year ago

Why do you do a toDF() after createDataFrame ? createDataFrame isn't enough ??

Martin Wunderlich · Accepted Answer · 2021-10-14 11:05:07Z

1

This post here provides a solution that doesn't go through sparkContext.parallelize(...): https://timepasstechies.com/create-spark-dataframe-java-list/

answered Oct 14, 2021 at 11:05

Martin Wunderlich

1,9641 gold badge19 silver badges43 bronze badges

Collectives™ on Stack Overflow

Creating a simple 1-row Spark DataFrame with Java API

4 Answers 4

1 Comment

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related