4

I have a hive table with the column "periode", the type of the column is string.

The column have values like the following:

[{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
[{periode:20161212-20161217,nb:0}]

I want to cast this column in array<struct<periode:string, nb:int>>. The final goal is to have one raw by periode. For this I want to use lateral view with explode on the column periode. That's why I want to convert it to array<struct<string, int>>

Thanks for help. Sidi

1
  • It is unclear what you want the final result to be. Commented Jan 24, 2017 at 21:06

3 Answers 3

1

You don't need to "cast" anything, you just need to explode the array and then unpack the struct. I added an index to your data to make it more clear where things are ending up.

Data:

idx arr_of_structs
0   [{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
1   [{periode:20161212-20161217,nb:0}]

Query:

SELECT idx                          -- index
  , my_struct.periode AS periode    -- unpacks periode
  , my_struct.nb      AS nb         -- unpacks nb
FROM database.table
LATERAL VIEW EXPLODE(arr_of_structs) exptbl AS my_struct

Output:

idx     periode                 nb
0       20160118-20160205       1
0       20161130-20161130       1
0       20161130-20161221       1
1       20161212-20161217       0

It's a bit unclear from your question what the desired result is, but as soon as you update it I'll modify the query accordingly.


EDIT:

The above solution is incorrect, I didn't catch that your input is a STRING.

Query:

SELECT REGEXP_EXTRACT(tmp_arr[0], "([0-9]{8}-[0-9]{8})") AS periode
  , REGEXP_EXTRACT(tmp_arr[1], ":([0-9]*)")              AS nb
FROM (
  SELECT idx
    , pos
    , COLLECT_SET(tmp_col) AS tmp_arr
  FROM (
    SELECT idx
      , tmp_col
      , CASE WHEN PMOD(pos, 2) = 0 THEN pos+1 ELSE pos END AS pos
    FROM (
      SELECT *
        , ROW_NUMBER() OVER () AS idx
      FROM database.table ) x
    LATERAL VIEW POSEXPLODE(SPLIT(periode, ',')) exptbl AS pos, tmp_col ) y
  GROUP BY idx, pos) z

Output:

periode                 nb
20160118-20160205       1
20161130-20161130       1
20161130-20161221       1
20161212-20161217       0    
Sign up to request clarification or add additional context in comments.

3 Comments

My Json is large - about compressed 600+ MB REGEXP_EXTRACT fails out of memory!
@SumitKumarGhosh is 600 MB large?
Yes, we too thought the same thing but when the job is running it is getting Heap Errors. And we are not able to process the files. 600 MB is max size of one json compressed - this needs to be split - so this seems to be loaded as one String. Hence the failures!
0

What about use the split function? you should be able to do something like

select nb, period from 
(select split(periode, "-") as periods, nb from yourtable) t
LATERAL VIEW explode(periods) sss AS period;

I didnt tried but it should work :)

EDIT: the above should work if you have a column periodes following a pattern date-date-date.. and a column nb, but it looks like that it isn't the case here. The following query should work for you (verbose but work)

select period, nb from (
select 
regexp_replace(split(split(tok1,",")[1],":")[1], "[\\]|}]", "") as nb,
split(split(split(tok1,",")[0],":")[1],"-") as periods
from
(select split(YOURSTRINGCOLUMN, "},") as s1 from YOURTABLE) 
r1 LATERAL VIEW explode(s1) ss1 AS tok1
) r2 LATERAL VIEW explode(periods) ss2 AS period;

5 Comments

SPLIT(periode, "-") makes no sense; periode is a column name inside an array of structs.
I have a hive table with the column "periode", the type of the column is string.....he has two columns..
No, this won't work. The OP specifically states their data is of type array<struct<periode:string, nb:int>>. You cannot call regex_replace() or split() on an array.
he doesn't have an array... his data is everyrhing is one strng, you are assuming that it is in an array because the data looks like an array. He is saying that we would like cast the string to a structure... wait for more clarification if you need :)
no problem, your solution is cleaner but much less efficient if we are talking about a lot of data, the solution without window functions doesn't reduce phase, which is an advantage if he is working with a lot of data
0

I realize this question is 1YO, but I ran into this same issue and tackled it by using the json_split brickhouse UDF.

SELECT EXPLODE(
    json_split(
        '[{"periode":"20160118-20160205","nb":1},{"periode":"20161130-20161130","nb":1},{"periode":"20161130-20161221","nb":1}]'
));

col
{"periode":"20160118-20160205","nb":1}
{"periode":"20161130-20161130","nb":1}
{"periode":"20161130-20161221","nb":1}

Sorry for the spaghetti code.

There's also a similar question here using JSON arrays instead of JSON strings. It's not the same case, but for anyone facing this kind of task it might be useful in a bigger context.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.