0
CREATE TABLE inventory_box (
 box_id varchar(10),
 value   integer
);

INSERT INTO inventory_box VALUES ('1', 10), ('2', 15), ('3', 20);

I prepared a sql fiddle with the schema.

I would like to select a list of inventory boxes with combined value of above 20

possible result 1. box 1 + box 2 (10 + 15 >= 20) 

Here is what I am doing right now:

SELECT * FROM inventory_box LIMIT 1 OFFSET 0;
-- count on the client side and see if I got enough
-- got 10
SELECT * FROM inventory_box LIMIT 1 OFFSET 1;
-- count on the client side and see if I got enough
-- got 15, add it to the first query which returned 10
-- total is 25, ok, got enough, return answer

I am looking for a solution where the scan will stop as soon as it reaches the target value

9
  • Ugh, that's a combinatorial problem. For anything except small numbers of boxes it'll be extremely, insanely expensive to compute, as the number of combinations to test is n! (n-factorial). Commented Mar 25, 2014 at 0:28
  • Sorry, I must have not expressed my question correctly. any combination as in I don't care which combination as long as it returns one; Please see the revised question. I suspect this could be done with some sort of recursive CTE. But for the life of me, can't figure out how to write such CTE Commented Mar 25, 2014 at 0:30
  • You should clarify what you want. If you only want a combination of items which exceeds 30 SELECT SUM(value) FROM inventory_box does the trick (by just selecting all rows...). Commented Mar 25, 2014 at 0:35
  • 1
    By the way you encountered the subset sum problem which is NP-complete in general. Do not expect a pretty solution here... Commented Mar 25, 2014 at 0:36
  • Ok, I think I need to clarify it further. I have thousands of boxes. I want to to any subset that fullfils that target value but not all of them Commented Mar 25, 2014 at 0:37

2 Answers 2

3

One possible approach scans the table in box_id order until the total is above 30, then returns all the previous rows plus the row that tipped the sum over the limit. Note that the scan doesn't stop when the sum is reached, it totals the whole table then goes back over the results to pick the results.

http://sqlfiddle.com/#!15/1c502/4

SELECT
  array_agg(box_id ORDER BY box_id) AS box_ids,
  max(boxsum) AS boxsum
FROM
(
  SELECT
    box_id,
    sum(value) OVER (ORDER BY box_id) AS boxsum,
    sum(value) OVER (ORDER BY box_id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevboxsum
  FROM 
    inventory_box
) x
WHERE prevboxsum < 30 OR prevboxsum IS NULL;

but really, this is going to be pretty gruesome to do in a general and reliable manner in SQL (or at all).

You can ORDER BY value ASC instead of ORDER BY box_id if you like; this will add boxes from the smallest to the biggest. However, this will catastrophically fail if you then remove all the small boxes from the pool and run it again, and repeat. Soon it'll just be lumping two big boxes together inefficiently.

To solve this for the general case, finding the smallest combination, is a hard optimization problem that probably benefits from imprecise sample- and probabilistic based methods.

To scan the table in order until the sum reaches the target, lock the table then use PL/PgSQL to read rows from a cursor that returns the rows in value order plus an array_agg(box_id) OVER (ORDER BY value) and sum(value) OVER (order by value). When you reach the desired sum, return the current row's array. This won't produce an optimal solution, but it'll produce a solution, and I think it'll do so without a full table scan if there's a suitable index in place.

Sign up to request clarification or add additional context in comments.

6 Comments

thanks for this answer. Although I don't need any kind of ordering at all. All I need is to return a subset that fullfills the target value. Although in my use case, a solution that is optimized for read performance would be preferable over a solution that optimizes for any ordering or smallest subset
I am looking at your solution on sql fiddle. If I remove order by and basically only have sum(value) OVER (box_id) AS boxsum, - do you think this would result in a faster read performance?
No. No matter what you do, it will do a full table scan. If you want to avoid that you're going to have to study the literature on this optimization problem and implement one of the algorithms that's known to work with it. As @GhostGambler said, this is a recognised "hard" problem, i.e. one that's computationally difficult to solve exactly.
Even if the target value is 10 and I only need to return box 1, it will still scan through millions of rows?
Would I have faster performance to just do select * from inventory_box LIMIT 1 OFFSET 0 and then OFFSET 1, and then OFFSET 2 keep iterating until I hit the target value? As Any combination would work.
|
2

Your question update clarifies that your actual requirements are much simpler than a full-blown "subset sum problem" as suspected by Ulrich:
Just fetch rows until the sum is big enough.

I sort by box_id to get deterministic results. You might even drop the ORDER BY altogether to get any valid result a bit faster, yet.

Slow: Recursive CTE

WITH RECURSIVE i AS (
   SELECT *, row_number() OVER (ORDER BY box_id) AS rn
   FROM   inventory_box
   )
, r AS (
   SELECT box_id, val, val AS total, 2 AS rn
   FROM   i
   WHERE  rn = 1

   UNION ALL
   SELECT i.box_id, i.val, r.total + i.val, r.rn + 1
   FROM   r
   JOIN   i USING (rn)
   WHERE  r.total < 20
   )
SELECT box_id, val, total
FROM   r
ORDER  BY box_id;

Fast: PL/pgSQL function with FOR loop

Using sum() as window aggregate function (cheapest).

CREATE OR REPLACE FUNCTION f_shop_for(_total int)
  RETURNS TABLE (box_id text, val int, total int)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   total := 0;

   FOR box_id, val, total IN
      SELECT i.box_id, i.val
           , sum(i.val) OVER (ORDER BY i.box_id) AS total
      FROM   inventory_box i
   LOOP
      RETURN NEXT;
      EXIT WHEN total >= _total;
   END LOOP; 
END
$func$;

Call:

SELECT * FROM f_shop_for(20);

I tested both with a big table of 1 million rows. The function only reads the necessary rows from index and table. The CTE is very slow, seems to scan the whole table ...

fiddle
OLD sqlfiddle

Aside: sorting by a varchar column (box_id) with numeric data yields dubious results. Should be a numeric type, really?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.