1

I've a column in a Hive table:

Column Name: Filters

Data Type:

 |-- filters: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: string (nullable = true)

I want to get the value from this column by it's corresponding name.

What I did so far:

val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")

val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()

Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))

Note: Casting to Seq because WrappedArray to Map conversion is not possible.

What to do next?

1 Answer 1

1
I want to get the value from this column by it's corresponding name.

If you want simple and reliable way to get all values by name, you may flatten your table using explode and filter:

case class Data(name: String, value: String)
case class Filters(filters: Array[Data])

val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
|       filters|
+--------------+
|[[a,b], [a,c]]|
|       [[b,c]]|
+--------------+

df.withColumn("filter", explode($"filters"))
  .select($"filter.name" as "name", $"filter.value" as "value")
  .where($"name" === "a")
  .show()
+----+-----+
|name|value|
+----+-----+
|   a|    b|
|   a|    c|
+----+-----+

You can also collect your data any way you want:

val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not  the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))

If you want to cast array[struct] to map[string, string] for future saving to some storage - it's different story, and this case is better solved by UDF. Anyway, you have to avoid collect() as long as it possible to keep your code scalable.

Sign up to request clarification or add additional context in comments.

1 Comment

I want to get the values b and c as a Sequence or Array of String not as Row object.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.