2

I have a table p that looks like this:

ID Col1
AAA kddd
AAA 13bd
AAA 14cd
AAA 15cd
BBB 15cd
BBB 23fd
BBB 4rre
BBB tr3e
CCC kddd
CCC 12ed
DDD rrr4
DDD rtt4
DDD rrt4

I have three lists of patterns that classify each group based on the values matching in Col1.

  1. If the codes are like ('_ddd', '_ccc', '_bbb', '_aaa') then return 'b'
  2. If the codes are like ('_3c_', '_3b_', '_3a_') then return 'S'
  3. If the codes are like ('_5c_', '_5b_', '_5a_') then return 'U'
  4. If none of the codes match then return 'U'

The patterns are much longer so I made temporary tables to store and call them

CREATE OR REPLACE TEMPORARY TABLE b_codes (value VARCHAR(4));
INSERT INTO b_codes (value) VALUES ('_ddd'), ('_ccc'), ('_bbb'), ('_aaa');

I did the same for s_codes and u_codes.

From the codes, if an ID contains none of the codes then mark 'U'. If an ID has any u_codes then mark 'U' if no s_codes or b_codes are present. If an ID has any b_codes, then mark as 'b'. If there are u_codes and s_codes mark 'S'.

The resulting table should look like

ID Col1
AAA S
BBB U
CCC b
DDD U

My attempt

SELECT ID, MAX(t.Flag) AS Flag
FROM (
   SELECT 
     ID,
     CASE
       WHEN (p.Col1 LIKE ANY (SELECT value FROM u_codes) AND
         NOT (
              p.Col1 LIKE ANY (SELECT value FROM s_codes) OR
              p.Col1 LIKE ANY (SELECT value FROM b_codes)
         ) THEN 'U'
       WHEN (p.Col1 LIKE ANY (SELECT value FROM s_codes) AND
         NOT (
              p.Col1 LIKE ANY (SELECT value FROM u_codes) OR
              p.Col1 LIKE ANY (SELECT value FROM b_codes)
         ) THEN 'S'
       WHEN (p.Col1 LIKE ANY (SELECT value FROM b_codes) THEN 'b'
       WHEN (
         NOT p.Col1 LIKE ANY (SELECT value FROM u_codes) AND
         NOT p.Col1 LIKE ANY (SELECT value FROM s_codes) AND
         NOT p.Col1 LIKE ANY (SELECT value FROM b_codes)
         ) THEN NULL
       ELSE NULL
     END AS Flag

) AS t
GROUP BY ID;

The sub-query should return

ID Col1 Flag
AAA kddd b
AAA 13bd S
AAA 14cd NULL
AAA 15cd U
BBB 15cd U
BBB 23fd NULL
BBB 4rre NULL
BBB tr3e NULL
CCC kddd b
CCC 12ed NULL
DDD rrr4 NULL
DDD rtt4 NULL
DDD rrt4 NULL

I tried using Snowflake's lexicographical ordering in the MAX function, but I don't think that works. What would be a better way to get the correct labels in the MAX function?

1
  • It looks to me like the very first sample row (AAA,kddd) should match the b code, and so the result for AAA should be b rather than S. Commented Oct 31 at 16:31

4 Answers 4

2

What would be a better way to get the correct labels in the MAX function?

The problem with the MAX() function (and similar) for this kind of query is you often need to use the max value from one column to show the corresponding value from another column, or you need to define some other criteria for what you mean by "MAX". This is possible with normal aggregation, but tends to be complex to write and maintain and slow to execute.

Instead, this is where analytic Window Functions shine. You use the window function function to define rankings that apply to entire rows within a partition, and then filter for only the row(s) where we find the desired rank. Then you can take values from whatever columns in those rows you need.


With this technique, Snowflake can find the desired results from this original problem with no nesting/subqueries* and only ONE mapping table, which can also be expressed concisely in the query and does not need to be a temp table:

SELECT p.ID, coalesce(m.code,'U') code 
FROM p
LEFT JOIN (
    VALUES
        (1, 'b', '_ddd'),
        (1, 'b', '_ccc'),
        (1, 'b', '_bbb'),
        (2, 'S', '_3c_'),
        (2, 'S', '_3b_'),
        (2, 'S', '_31_'),
        -- U codes not needed, since it's the default; included only for completeness
        (3, 'U', '_5c_'),
        (3, 'U', '_5b_'),
        (3, 'U', '_5a_')
    ) m(precedence, code, expr) ON p.Col1 LIKE m.expr
QUALIFY row_number() over (partition by p.ID order by m.precedence) = 1

More than 9 times in 10, if you have a temp table you should have something like a subquery, table-value constructor, or common table expression instead.


"The patterns are much longer ... "

This might justify a temporary table vs the table-value constructor, but if you can build the INSERT sql you can build this just as easily.

Even if you do continue to use a temp table I would still use this structure with the single mapping table. At most I might normalize it to two tables so each code/precedence pair has one row in a parent table, and then join to a child table for just code+expr columns. But that complexity is probably not worth it here.

If the data is really that detailed, I'd also look to make this permanent, so the data can also be maintained outside of this query and perhaps even indexed (but Snowflake will probably do just fine w/o the index/cluster).

Note when we do pull this data to it's own location, the query reduces to just four lines:

SELECT p.ID, coalesce(m.code,'U') code 
FROM p
LEFT JOIN code_map m ON p.Col1 LIKE m.expr
QUALIFY row_number() over (partition by p.ID order by m.precedence) = 1

"The resulting table should look like ... "

I think you have an error here. The very first row in the sample data — (AAA,kddd) — should match the b code. However, the expected results for the AAA ID shows an S, which has lower precedence. (This is another reason not to normalize the mapping table; repeating the precedence in each row allows for more complex rules where sometimes a "lesser" code might still win).


Formal documentation for some of the less-common features used here:


* Not counting the table-value constructor, which has neither SELECT nor FROM

Sign up to request clarification or add additional context in comments.

Comments

2
SELECT 
    p.id,
    CASE
        WHEN EXISTS (SELECT 1 FROM b_codes WHERE p.Col1 LIKE value) THEN 'b'
        WHEN EXISTS (SELECT 1 FROM s_codes WHERE p.Col1 LIKE value) THEN 'S'
        WHEN EXISTS (SELECT 1 FROM u_codes WHERE p.Col1 LIKE value) THEN 'U'
        ELSE 'U'
    END AS Flag
FROM p;

output:

ID Flag
AAA b
AAA S
AAA U
AAA U
BBB U
BBB U
BBB U
BBB U
CCC b
CCC U
DDD U
DDD U
DDD U

Comments

0

Matching 3 of the same letter in Snowflake is surprisingly not straightforward. One option would be to use REGEXP_LIKE with an alternation:

SELECT
    ID,
    CASE WHEN REGEXP_LIKE(Col1, 'AAA|BBB|CCC|DDD|EEE|FFF|GGG|HHH|III|JJJ|KKK|LLL|MMM|NNN|OOO|PPP|QQQ|RRR|SSS|TTT|UUU|VVV|WWW|XXX|YYY|ZZZ', 'i')
         THEN 'b'
         WHEN REGEXP_LIKE(Col1, '.3[A-Z].', 'i') THEN 'S'
         ELSE 'U' END AS Flag
FROM yourTable;

3 Comments

If I'm reading this right, I don't think the first when statement matches the patterns for b_codes, it's searching Col1 for ID values. I don't understand why to do that.
Apologies, I realized I missed the END of the CASE WHEN statement. Also added my expected subquery result.
What information do you want to tell us? If the answer solves your task, then accept it. If you want to edit your question, just do it. No need to comment this. Please remove your comments here unless you have a question concerning the answer. In this case, ask.
0

You don't need to create so many tables. Instead, you could create your temporary table in a manner that the replacement will be inside of it:

CREATE OR REPLACE TEMPORARY TABLE codes (value VARCHAR(4), code VARCHAR(1));
INSERT INTO codes (value, code) VALUES
('_ddd', 'b'), ('_ccc', 'b'), ('_bbb', 'b'), ('_aaa', 'b'),
('_3c_', 'S'), ('_3b_', 'S'), ('_3a_', 'S');

And then you can left join:

select
    ID,
    case
        when max(rank) = 2 then 'b'
        when max(rank) = 1 then 'S'
        else 'U'
    end as result
from (
    select 
        p.ID,
        case
            when codes.value = 'b' then 2
            when codes.value = 'S' then 1
        end as rank
    from p
    left join codes
    on p.col1 like codes.value) t
group by ID

In the subquery you get all code matches for col1, ranked as 2 if they were a b and as 1 if they were an S, defaulting to null for other values. Then the outer query loads these records, group them by ID and looks for the greatest rank. If it was a 2, then it will be a b. If it was a 1, then it will be an S. Otherwise it will be a U. I prioritised b over S, but given the values it does not seem to be a problem, as all b values mutually exclusive in the rules you have given.

I also did not apply point 3., as the evaluation result would be an exact match to the absolute fallback, so it is not worth to check for 3 and we can treat 3 and 4 together as the absolute fallback.

EDIT -> further simplification as suggested in the comment-section:

CREATE OR REPLACE TEMPORARY TABLE codes (value VARCHAR(4), code VARCHAR(1));
INSERT INTO codes (value, code) VALUES
('_ddd', 'b'), ('_ccc', 'b'), ('_bbb', 'b'), ('_aaa', 'b'),
('_3c_', 'S'), ('_3b_', 'S'), ('_3a_', 'S');
CREATE OR REPLACE TEMPORARY TABLE ranks(code VARCHAR(1), rank int);
INSERT INTO ranks (code, rank) VALUES
('b', 2),
('S', 1);

We join with the ranks table too:

select
    ID,
    max(rank) as result
from (
    select 
        p.ID,
        ranks.rank
    from p
    left join codes
    on p.col1 like codes.value
    left join ranks
    on codes.code = ranks.code
) t
group by ID

2 Comments

Put the rank value inside the temp table, too, to simplify this even further.
Great idea, thank you! I created another temporary table in order to avoid redundancy, but applied your changes in an edit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.