0

I am loading mongodb to hive table and trying to solve Unsupported NullType when saveAsTable. Sample data schema

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)

I tried functions.lit like

df = df.withColumn("level1.level2.level3_1", functions.lit("null").cast("string"));
       .withColumn("level1.level2.level3_2.level4", functions.lit("null").cast("string"));

but the result is like

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)
 |-- level1.level2.level3_1: string (nullable = false)
 |-- level1.level2.level3_2.level4: string (nullable = false)

I also checked df.na().fill() but this seems not changing the schema.

The desired result is

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)

and that I can use loaded mongodb data save as table to hive

Does anyone have worked on this and could give me some advise that how to cast nested nulltype or how to deal with nulltype in java. Think of a systematic/general solution that can scale for more complex data. Many thanks

1 Answer 1

1

One idea is a creating schema with StringType and read data with the schema.

StructType schema = createStructType(Arrays.asList(
    createStructField("level1", createStructType(Arrays.asList(
        createStructField("level2", createStructType(Arrays.asList(
            createStructField("level3_1", StringType, true),
            createStructField("level3_2", createArrayType(createStructType(Arrays.asList(
                createStructField("level4", StringType, true)))), true)
            )), true))), true)));

// Replace new ArrayList<>() to your dataset.
Dataset<Row> df = ss.createDataFrame(new ArrayList<>(), schema);
df.printSchema();
root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)


EDIT:

I added more intuitive example here to deliver my thought. I hope it helps you.

@Test
public void test() {
    SparkSession ss = SparkSession.builder().master("local").appName("test").getOrCreate();

    // Step1) read your mongoDB data. (I added NullType field 'level' manually for explaination.
    // https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
    Dataset<Row> data = ss.read().json("test.json").withColumn("level", lit(null));
    data.printSchema();

    StructType schema = createStructType(Arrays.asList(
        createStructField("_id", LongType, true),
        createStructField("level", StringType, true)));

    // Step2) create newData using schema you defined. 
    Dataset<Row> newData = ss.createDataFrame(data.collectAsList(), schema);
    newData.printSchema();

    // Step3) load newData to Hive
}
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you, what if I want to have the data? because this is change the schema but the goal is to load the mongodb data and save as table in hive
And think of what if the schema is complex, is there a way to do it systematically
Yes It's looks not good way if schema is complex but I think It is a problem when we use spark with java. You don't need to create schema and can use even more easy library to do this with scala or python.
I am very much appreciate your example. This is very precious information for java users. What if, say we do another way around, map the rows and if the row has null and change it? do you think this approach would work? If you are familiar with mapping the rows would be the best.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.