Extract json data from StringType Spark.SQL

Question

There is hive table with single column of type string.

hive> desc logical_control.test1;
OK
test_field_1          string                  test field 1

val df2 = spark.sql("select * from logical_control.test1")

df2.printSchema()

root
|-- test_field_1: string (nullable = true)

df2.show(false)

+------------------------+
|test_field_1            |
+------------------------+
|[[str0], [str1], [str2]]|
+------------------------+

How to transform it to structured column like below?

root
|-- A: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- S: string (nullable = true)

I tried to recover it with initial schema that data being structured before it was written to the hdfs. But json_data is null.

val schema = StructType(
    Seq(
      StructField("A", ArrayType(
        StructType(
          Seq(
            StructField("S", StringType, nullable = true))
        )
      ), nullable = true)
    )
  )

val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))

df3.printSchema()

root
|-- test_field_1: string (nullable = true)
|-- json_data: struct (nullable = true)
|    |-- A: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- S: string (nullable = true)

df3.show(false)

+------------------------+---------+
|test_field_1            |json_data|
+------------------------+---------+
|[[str0], [str1], [str2]]|null     |
+------------------------+---------+

Could you add desc formatted logical_control.test1; to the question? — notNull
– notNull, Commented Mar 25, 2020 at 14:28
@Shu hive> desc logical_control.test1; OK test_field_1 string test field 1 Time taken: 0.673 seconds, Fetched: 1 row(s) — Scirocco
– Scirocco, Commented Mar 26, 2020 at 4:45

werner · Accepted Answer · 2020-03-25 20:09:24Z

1

If the structure of test_field_1 is fixed and you don't mind "parsing" the field yourself, you can use an udf to perform the transformation:

case class S(S:String)
def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
val toArrayUdf = udf(toArray)

val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
df3.printSchema()
df3.show(false)

prints

root
 |-- test_field_1: string (nullable = true)
 |-- json_data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- S: string (nullable = true)

+------------------------+------------------------+
|test_field_1            |json_data               |
+------------------------+------------------------+
|[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
+------------------------+------------------------+

The tricky part is to create the second level (element: struct) of the structure. I have used the case class S to create this struct.

answered Mar 25, 2020 at 20:09

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Scirocco Over a year ago

Is there a way to avoid parsing the field yourself?

werner Over a year ago

I didn't find one

Scirocco Over a year ago

Let me ask you information out of initial question. May be you have some idea about dividing similar structures to few columns before write to hdfs e.g. or some other methods to convenient use the data.

werner Over a year ago

You could try to write the data to hdfs in a format that supports nested structures (like parquet), so you don't have troubles when reading the data again.

Collectives™ on Stack Overflow

Extract json data from StringType Spark.SQL

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related