1

There is hive table with single column of type string.

hive> desc logical_control.test1;
OK
test_field_1          string                  test field 1
val df2 = spark.sql("select * from logical_control.test1")

df2.printSchema()
root
|-- test_field_1: string (nullable = true)
df2.show(false)
+------------------------+
|test_field_1            |
+------------------------+
|[[str0], [str1], [str2]]|
+------------------------+

How to transform it to structured column like below?

root
|-- A: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- S: string (nullable = true)

I tried to recover it with initial schema that data being structured before it was written to the hdfs. But json_data is null.

val schema = StructType(
    Seq(
      StructField("A", ArrayType(
        StructType(
          Seq(
            StructField("S", StringType, nullable = true))
        )
      ), nullable = true)
    )
  )

val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))

df3.printSchema()
root
|-- test_field_1: string (nullable = true)
|-- json_data: struct (nullable = true)
|    |-- A: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- S: string (nullable = true)
df3.show(false)
+------------------------+---------+
|test_field_1            |json_data|
+------------------------+---------+
|[[str0], [str1], [str2]]|null     |
+------------------------+---------+
2
  • Could you add desc formatted logical_control.test1; to the question? Commented Mar 25, 2020 at 14:28
  • @Shu hive> desc logical_control.test1; OK test_field_1 string test field 1 Time taken: 0.673 seconds, Fetched: 1 row(s) Commented Mar 26, 2020 at 4:45

1 Answer 1

1

If the structure of test_field_1 is fixed and you don't mind "parsing" the field yourself, you can use an udf to perform the transformation:

case class S(S:String)
def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
val toArrayUdf = udf(toArray)

val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
df3.printSchema()
df3.show(false)

prints

root
 |-- test_field_1: string (nullable = true)
 |-- json_data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- S: string (nullable = true)

+------------------------+------------------------+
|test_field_1            |json_data               |
+------------------------+------------------------+
|[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
+------------------------+------------------------+

The tricky part is to create the second level (element: struct) of the structure. I have used the case class S to create this struct.

Sign up to request clarification or add additional context in comments.

4 Comments

Is there a way to avoid parsing the field yourself?
I didn't find one
Let me ask you information out of initial question. May be you have some idea about dividing similar structures to few columns before write to hdfs e.g. or some other methods to convenient use the data.
You could try to write the data to hdfs in a format that supports nested structures (like parquet), so you don't have troubles when reading the data again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.