9

In Scala, I can create a single-row DataFrame from an in-memory string like so:

val stringAsList = List("buzz")
val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz")
df.show()

When df.show() runs, it outputs:

+-----+
| fizz|
+-----+
| buzz|
+-----+

Now I'm trying to do this from inside a Java class. Apparently JavaRDDs don't have a toDF(String) method. I've tried:

List<String> stringAsList = new ArrayList<String>();
stringAsList.add("buzz");
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame df = sqlContext.createDataFrame(sparkContext
    .parallelize(stringAsList), StringType);
df.show();

...but still seem to be coming up short. Now when df.show(); executes, I get:

++
||
++
||
++

(An empty DF.) So I ask: Using the Java API, how do I read an in-memory string into a DataFrame that has only 1 row and 1 column in it, and also specify the name of that column? (So that the df.show() is identical to the Scala one above)?

1
  • I think, in your first example, you have a typo and should replace jsonValues -> stringAsList I'm still new to this syntax though Commented Jun 30, 2020 at 14:31

4 Answers 4

14

I have created 2 examples for Spark 2 if you need to upgrade:

Simple Fizz/Buzz (or foe/bar - old generation :) ):

    SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
            .getOrCreate();

    List<String> stringAsList = new ArrayList<>();
    stringAsList.add("bar");

    JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());

    JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> RowFactory.create(row));

    // Creates schema
    StructType schema = DataTypes.createStructType(
            new StructField[] { DataTypes.createStructField("foe", DataTypes.StringType, false) });

    Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();

2x2 data:

    SparkSession spark = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]")
            .getOrCreate();

    List<String[]> stringAsList = new ArrayList<>();
    stringAsList.add(new String[] { "bar1.1", "bar2.1" });
    stringAsList.add(new String[] { "bar1.2", "bar2.2" });

    JavaSparkContext sparkContext = new JavaSparkContext(spark.sparkContext());

    JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String[] row) -> RowFactory.create(row));

    // Creates schema
    StructType schema = DataTypes
            .createStructType(new StructField[] { DataTypes.createStructField("foe1", DataTypes.StringType, false),
                    DataTypes.createStructField("foe2", DataTypes.StringType, false) });

    Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();

Code can be downloaded from: https://github.com/jgperrin/net.jgp.labs.spark.

Sign up to request clarification or add additional context in comments.

1 Comment

How can this be done for mixed data type ? Let's say if values to create data frame is "bar1.1" (String) and 10(Int)
8

You can achieve this by creating List to Rdd and than create Schema which will contain column name.

There might be other ways as well, it's just one of them.

List<String> stringAsList = new ArrayList<String>();
        stringAsList.add("buzz");

JavaRDD<Row> rowRDD = sparkContext.parallelize(stringAsList).map((String row) -> {
                return RowFactory.create(row);
            });

StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField("fizz", DataTypes.StringType, false) });

DataFrame df = sqlContext.createDataFrame(rowRDD, schema).toDF();
df.show();

//+----+
|fizz|
+----+
|buzz|

Comments

1

Building on what @jgp suggested. If you want to do this for mixed types you can do:

List<Tuple2<Integer, Boolean>> mixedTypes = Arrays.asList(
                new Tuple2<>(1, false),
                new Tuple2<>(1, false),
                new Tuple2<>(1, false));

JavaRDD<Row> rowRDD = sparkContext.parallelize(mixedTypes).map(row -> RowFactory.create(row._1, row._2));

StructType mySchema = new StructType()
                .add("id", DataTypes.IntegerType, false)
                .add("flag", DataTypes.BooleanType, false);

Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, mySchema).toDF();

This might help with the @jdk2588 's question.

2 Comments

I have stumbled over the same issue and it seems that Spark MLLib v. 3.0.2 does not provide a method sparkContext.parallelize() that takes only a list as parameter. So, it would be good to know with which version this code is working (applies to the other replies, too).
Why do you do a toDF() after createDataFrame ? createDataFrame isn't enough ??
1

This post here provides a solution that doesn't go through sparkContext.parallelize(...): https://timepasstechies.com/create-spark-dataframe-java-list/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.