How to cast a string to array of struct in HiveQL

Question

I have a hive table with the column "periode", the type of the column is string.

The column have values like the following:

[{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
[{periode:20161212-20161217,nb:0}]

I want to cast this column in array<struct<periode:string, nb:int>>. The final goal is to have one raw by periode. For this I want to use lateral view with explode on the column periode. That's why I want to convert it to array<struct<string, int>>

Thanks for help. Sidi

It is unclear what you want the final result to be.

o-90
– o-90

2017-01-24 21:06:33 +00:00
Commented Jan 24, 2017 at 21:06 — o-90
– o-90, Commented Jan 24, 2017 at 21:06

o-90 · Accepted Answer · 2017-01-25 02:18:27Z

1

You don't need to "cast" anything, you just need to explode the array and then unpack the struct. I added an index to your data to make it more clear where things are ending up.

Data:

idx arr_of_structs
0   [{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
1   [{periode:20161212-20161217,nb:0}]

Query:

SELECT idx                          -- index
  , my_struct.periode AS periode    -- unpacks periode
  , my_struct.nb      AS nb         -- unpacks nb
FROM database.table
LATERAL VIEW EXPLODE(arr_of_structs) exptbl AS my_struct

Output:

idx     periode                 nb
0       20160118-20160205       1
0       20161130-20161130       1
0       20161130-20161221       1
1       20161212-20161217       0

It's a bit unclear from your question what the desired result is, but as soon as you update it I'll modify the query accordingly.

EDIT:

The above solution is incorrect, I didn't catch that your input is a STRING.

Query:

SELECT REGEXP_EXTRACT(tmp_arr[0], "([0-9]{8}-[0-9]{8})") AS periode
  , REGEXP_EXTRACT(tmp_arr[1], ":([0-9]*)")              AS nb
FROM (
  SELECT idx
    , pos
    , COLLECT_SET(tmp_col) AS tmp_arr
  FROM (
    SELECT idx
      , tmp_col
      , CASE WHEN PMOD(pos, 2) = 0 THEN pos+1 ELSE pos END AS pos
    FROM (
      SELECT *
        , ROW_NUMBER() OVER () AS idx
      FROM database.table ) x
    LATERAL VIEW POSEXPLODE(SPLIT(periode, ',')) exptbl AS pos, tmp_col ) y
  GROUP BY idx, pos) z

Output:

periode                 nb
20160118-20160205       1
20161130-20161130       1
20161130-20161221       1
20161212-20161217       0

edited Jan 25, 2017 at 2:18

answered Jan 24, 2017 at 21:18

o-90

17.7k10 gold badges44 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user 923227 Over a year ago

My Json is large - about compressed 600+ MB REGEXP_EXTRACT fails out of memory!

o-90 Over a year ago

@SumitKumarGhosh is 600 MB large?

user 923227 Over a year ago

Yes, we too thought the same thing but when the job is running it is getting Heap Errors. And we are not able to process the files. 600 MB is max size of one json compressed - this needs to be split - so this seems to be loaded as one String. Hence the failures!

hlagos · Accepted Answer · 2017-01-24 23:46:04Z

0

What about use the split function? you should be able to do something like

select nb, period from 
(select split(periode, "-") as periods, nb from yourtable) t
LATERAL VIEW explode(periods) sss AS period;

I didnt tried but it should work :)

EDIT: the above should work if you have a column periodes following a pattern date-date-date.. and a column nb, but it looks like that it isn't the case here. The following query should work for you (verbose but work)

select period, nb from (
select 
regexp_replace(split(split(tok1,",")[1],":")[1], "[\\]|}]", "") as nb,
split(split(split(tok1,",")[0],":")[1],"-") as periods
from
(select split(YOURSTRINGCOLUMN, "},") as s1 from YOURTABLE) 
r1 LATERAL VIEW explode(s1) ss1 AS tok1
) r2 LATERAL VIEW explode(periods) ss2 AS period;

edited Jan 24, 2017 at 23:46

answered Jan 24, 2017 at 20:59

hlagos

7,9873 gold badges26 silver badges46 bronze badges

5 Comments

o-90 Over a year ago

SPLIT(periode, "-") makes no sense; periode is a column name inside an array of structs.

hlagos Over a year ago

I have a hive table with the column "periode", the type of the column is string.....he has two columns..

o-90 Over a year ago

No, this won't work. The OP specifically states their data is of type array<struct<periode:string, nb:int>>. You cannot call regex_replace() or split() on an array.

hlagos Over a year ago

he doesn't have an array... his data is everyrhing is one strng, you are assuming that it is in an array because the data looks like an array. He is saying that we would like cast the string to a structure... wait for more clarification if you need :)

hlagos Over a year ago

no problem, your solution is cleaner but much less efficient if we are talking about a lot of data, the solution without window functions doesn't reduce phase, which is an advantage if he is working with a lot of data

misterte · Accepted Answer · 2018-02-05 22:24:35Z

0

I realize this question is 1YO, but I ran into this same issue and tackled it by using the json_split brickhouse UDF.

SELECT EXPLODE(
    json_split(
        '[{"periode":"20160118-20160205","nb":1},{"periode":"20161130-20161130","nb":1},{"periode":"20161130-20161221","nb":1}]'
));

col
{"periode":"20160118-20160205","nb":1}
{"periode":"20161130-20161130","nb":1}
{"periode":"20161130-20161221","nb":1}

Sorry for the spaghetti code.

There's also a similar question here using JSON arrays instead of JSON strings. It's not the same case, but for anyone facing this kind of task it might be useful in a bigger context.

answered Feb 5, 2018 at 22:24

misterte

1,0071 gold badge12 silver badges21 bronze badges

Collectives™ on Stack Overflow

How to cast a string to array of struct in HiveQL

3 Answers 3

EDIT:

3 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

EDIT:

3 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related