0

I have the column that is a string of values (with no particular order) separated by the comma:

event_list

2,100,101,102,103,104,105,106,110,114,121,126,152,185,191,524,150,198,158,111,20
100,101,102,103,104,110,114,121,126,152,175,185,191,150,198,158,111,123,10091

Of the values I am only interested in 1,2,10,11,12,13,14 and 20. The rest are irrelevant. For example 2 - "Product view" and 12 - "Add to cart".

So I am trying to do something like

CASE WHEN 2 IN event_list THEN 1 ELSE 0 END as product_view flag,
CASE WHEN 12 IN event_list THEN 1 ELSE 0 END as add_to_cart_flag
...

But since it's SQL and not Python, I don't think above is possible, hence trying to figure out how to do it. And I don't think using regex will be helpful, since '120' would contain '2' as well.

STRING_SPLIT is not an optimal solution because the data is already 900Bil rows.

11
  • 1
    Storing delimited lists like this is the root of your problem. It violates 1NF by storing multiple values in a single tuple. It is going to be a challenge for performance but you can use STRING_SPLIT. learn.microsoft.com/en-us/sql/t-sql/functions/… Commented May 31, 2022 at 16:29
  • @SeanLange yeah, that is the problem of underlying data. Unfortunately string_split is not going to work, because the data is 900BIL rows already. If I do string split, it will blow out to trillions of rows, which does not make sense. Commented May 31, 2022 at 16:31
  • Don't know what to tell you. The design is forcing you to parse the rows, parsing the rows is a mountain of data. Commented May 31, 2022 at 16:37
  • Does this HAVE to be on database-side? can't you just get the whole column and parse/process the values in the client-side? Commented May 31, 2022 at 17:02
  • Does the partition or resultset of the query you are working with have 900 billion rows or is that how big the table is? Commented May 31, 2022 at 17:07

2 Answers 2

1

A couple of methods would be

SELECT *
FROM YourTable yt
CROSS APPLY
(
SELECT 
        MAX(CASE WHEN value = '2' THEN 1 ELSE 0 END)  as product_view_flag,
        MAX(CASE WHEN value = '12' THEN 1 ELSE 0 END)  as add_to_cart_flag
FROM STRING_SPLIT(yt.event_list, ',')
) ca

or

SELECT yt.*, 
            CASE WHEN adj_event_list LIKE '%,2,%' THEN 1 ELSE 0 END as product_view_flag, 
            CASE WHEN adj_event_list LIKE '%,12,%' THEN 1 ELSE 0 END as add_to_cart_flag
FROM YourTable yt
CROSS APPLY (SELECT CONCAT(',',yt.event_list,',')) CA(adj_event_list)

If you are actually running this on 900Bil rows both will be slow. I can't guess which will "win" - you would need to test both.

Sign up to request clarification or add additional context in comments.

5 Comments

It appears that the second query will be more performant with 1 table scan at a 100% cost with 1 compute scalar into the select . The first query is slower because it has to nest join the TVP results in with the 1 table scan.
The only way to know is to test them both. Both will be CPU bound operations and estimated costs for CPU time in execution plans are extremely unreliable.
I ran those queries with 100, 10, 000, 100,000, 1,000,000 and 10,000,000. The actual plan did not change with a very close 70% of overall query time for query 1 and 30% for query two.
You can't go off the costings in execution plans at all for this (70% vs 30%). The only difference between "actual" and "estimated" plans is that the "actual" plan is the estimated plan with certain runtime stats added. This does not include any adjustment of costings. You would need to run them and get the actual CPU time stats (as shown in STATISTICS TIME ON results or query stats DMVs). I wouldn't be astonished to find out that the first query is slower but that is the way to determine it
Martin Smith - Understood, there are other factors involved. If that table is partitioned then the number of partitions accessed and the underlying hardware that each accessed partition resides on may change things up when approaching billions of records access, mileage may vary.
0

Actually everything was much easier and good performance wise.

CASE WHEN event_list LIKE '2,%' OR event_list LIKE '%,2,%' OR event_list LIKE '%,2' 
THEN 1 ELSE 0 END AS product_view_flag,
CASE WHEN event_list LIKE '12,%' OR event_list LIKE '%,12,%' OR event_list LIKE '%,12' 
THEN 1 ELSE 0 END AS cart_view_flag,
...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.