Spark java Issue creating row with java.util.Map type

Question

Using spark 2.1

I created a DataSet with MapDataType inside

StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("words", DataTypes.StringType, false, Metadata.empty()),
                new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("features", DataTypes.createMapType(DataTypes.StringType, DataTypes.IntegerType), false, Metadata.empty())
        });

        Map<String,Integer> abc = new HashMap<String,Integer>();
        abc.put("abc", 1);
        Row r = RowFactory.create(0, "Hi these are words ", 1, abc);    
        List<Row> data = Arrays.asList(r);
        Dataset<Row> wordDataFrame = spark.createDataFrame(data, schema);
        wordDataFrame.show();

This above code works fine.

But when I try to call a map function on this DataSet (to replace the Map DataType entries with a new HashMap<String, Integer>), I get the following error.

StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("words", DataTypes.StringType, false, Metadata.empty()),
                new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("featuresNew", DataTypes.createMapType(DataTypes.StringType, DataTypes.IntegerType), false, Metadata.empty())
        });


        ExpressionEncoder<Row> encoder = RowEncoder.apply(schema);

        Dataset<Row> output = input.map(new MapFunction<Row, Row>() {
            @Override
            public Row call(Row row) throws Exception {
                Map<String, Integer> newMap = new HashMap<String, Integer>();
                newMap.put("Transformed string", 1);
                return RowFactory.create(row.getInt(0), row.getString(1), row.getInt(2), newMap);
            }
        }, encoder);

        return output;

Error Stack:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: java.util.HashMap is not a valid external type for schema of map<string,int>
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:410)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

What am I missing here? Why do I get "java.util.HashMap is not a valid external type for schema of map<string,int>" error

Edit:

I tried java.util.List Datatype

StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("words", DataTypes.StringType, false, Metadata.empty()),
                new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("featuresNew", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty())
        });

ExpressionEncoder<Row> encoder = RowEncoder.apply(schema);
Dataset<Row> output = input.map(new MapFunction<Row, Row>() {
            @Override
            public Row call(Row row) throws Exception {
            List<String> xyz = Arrays.asList("Hi", "how", "now");

                return RowFactory.create(row.getInt(0), row.getString(1), row.getInt(2), xyz);
            }
        }, encoder);

I get a similar error msg

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:221)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

java.lang.String works fine

StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("words", DataTypes.StringType, false, Metadata.empty()),
                new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("featuresNew", DataTypes.StringType, false, Metadata.empty())
        });


        ExpressionEncoder<Row> encoder = RowEncoder.apply(schema);
        Dataset<Row> output = input.map(new MapFunction<Row, Row>() {
            @Override
            public Row call(Row row) throws Exception {                  
                String xyz = Arrays.asList("Please", "work", "now").toString();    
                return RowFactory.create(row.getInt(0), row.getString(1), row.getInt(2), xyz);
            }
        }, encoder);

Looks like the primitive DataTypes are working fine!

Good work; buy you should post your answer as an answer to your own question. Then you can accept it! — kingledion
– kingledion, Commented Sep 19, 2018 at 20:49

vanilla.ds · Accepted Answer · 2020-08-10 17:23:19Z

3

If you look at row.getMap(3). it's returning scala.collection.Map

scala.collection.Map<Object, Object> map = row.getMap(3);

So, it seems you need to use scala.collection.JavaConverters

JavaConverters.mapAsScalaMapConverter(newMap).asScala();

answered Aug 10, 2020 at 17:23

vanilla.ds

412 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dennis Jaheruddin · Accepted Answer · 2020-08-01 21:21:36Z

The following was actually found by the asker, I extracted it from the question so others may find the answer in its right place:

Solution: This worked for me

I used [Converting Java HashMap to Scala Map][1] and changed the code as follows

StructType schema = new StructType(new StructField[]{
                    new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                    new StructField("words", DataTypes.StringType, false, Metadata.empty()),
                    new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
                    new StructField("featuresNew", DataTypes.createMapType(DataTypes.StringType, DataTypes.IntegerType), false, Metadata.empty())
            });
    
    
            ExpressionEncoder<Row> encoder = RowEncoder.apply(schema);
    
            Dataset<Row> output = input.map(new MapFunction<Row, Row>() {
                @Override
                public Row call(Row row) throws Exception {
                    HashMap<String, Integer> newMap = new HashMap<String,Integer();                    
                    newMap.put("Transformed string", 1);                    
                    return RowFactory.create(row.getInt(0), row.getString(1), row.getInt(2), ToScalaExample.toScalaMap(newMap));
                }
            }, encoder);
    
            return output;

I think for primitive Datatypes spark implicitly converts java Datatypes to Scala Datatypes. For other we need to explicitly convert them.

Collectives™ on Stack Overflow

Spark java Issue creating row with java.util.Map type

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related