3

I try to search a solution but I didn't find anything for my case...

Here is the database declaration (simplified):

CREATE TABLE documents (
    document_id int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
    data_block jsonb NULL
);

And this is an example of insert.

INSERT INTO documents (document_id, data_block)
VALUES(878979, 
    {"COMMONS": {"DATE": {"value": "2017-03-11"}},
     "PAYABLE_INVOICE_LINES": [
         {"AMOUNT": {"value": 52408.53}}, 
         {"AMOUNT": {"value": 654.23}}
     ]});
INSERT INTO documents (document_id, data_block)
VALUES(977656, 
    {"COMMONS": {"DATE": {"value": "2018-03-11"}},
     "PAYABLE_INVOICE_LINES": [
         {"AMOUNT": {"value": 555.10}}
     ]});

I want to search all documents where one of the PAYABLE_INVOICE_LINES has a line with a value greater than 1000.00

My query is

select *
from documents d
cross join lateral jsonb_array_elements(d.data_block -> 'PAYABLE_INVOICE_LINES') as pil 
where (pil->'AMOUNT'->>'value')::decimal >= 1000

But, as I want to limit to 50 documents, I have to group on the document_id and limit the result to 50.

With millions of documents, this query is very expensive... 10 seconds with 1 million.

Do you have some ideas to have better performance ?

Thanks

1
  • I'm stuck on PG 9.3 at the moment so don't have that data type yet, but I briefly worked on a PG 9.6 project where we stored data blobs in jsonb fields, and you could create an index on values in that field which had pretty ok performance. Maybe that's what you should look into, if you have to keep the structure as it is. Commented Mar 31, 2018 at 10:55

2 Answers 2

5

Instead of cross join lateral use where exists:

select *
from documents d
where exists (
  select 1
  from jsonb_array_elements(d.data_block -> 'PAYABLE_INVOICE_LINES') as pil
  where (pil->'AMOUNT'->>'value')::decimal >= 1000)
limit 50;

Update

And yet another method, more complex but also much more efficient.

Create function that returns max value from your JSONB data, like this:

create function fn_get_max_PAYABLE_INVOICE_LINES_value(JSONB) returns decimal language sql as $$
  select max((pil->'AMOUNT'->>'value')::decimal)
  from jsonb_array_elements($1 -> 'PAYABLE_INVOICE_LINES') as pil $$

Create index on this function:

create index idx_max_PAYABLE_INVOICE_LINES_value
  on documents(fn_get_max_PAYABLE_INVOICE_LINES_value(data_block));

Use function in your query:

select *
from documents d
where fn_get_max_PAYABLE_INVOICE_LINES_value(data_block) > 1000
limit 50;

In this case the index will be used and query will be much faster on large amount of data.

PS: Usually limit have sense in pair with order by.

Sign up to request clarification or add additional context in comments.

4 Comments

No reason to believe rewriting join to exists would do anything. But creating an index on the function should make all the difference
@Andomar Probably you are right, execution time almost same on my quick test: dbfiddle.uk/… However IMO using exists makes query more clear.
Hum, I'm just trying on my database, and the exists solution seems to be very efficient. Using a specific function can be a better solution but I have to do this work for many others operators like >=, <, <=, !=, ... because it is in fact a search engine... Really thank you !!!
@Ryu "but I have to do this work for many others operators like" There are three functions and indexes on them cold make your engine faster: fn_get_max() returns numeric..., fn_get_min() returns numeric... and fn_get_array() returns numeric[]...; index on documents using gin(fn_get_array(data_block)); Thus, >= operator could be realized like where fn_get_max(data_block) > 1000 or array[1000] <@ fn_get_array(data_block)
1

Grouping and limiting is easy enough:

select  document_id
from    documents d
cross join lateral 
        jsonb_array_elements(d.data_block -> 'PAYABLE_INVOICE_LINES') as pil 
where   (pil->'AMOUNT'->>'value')::decimal >= 1000
group by
        document_id
limit   50

If you query this more often, you could store a list of documents and invoice lines in a separate table. When you're adding, modifying or deleting documents, you'd have to keep the separate table up to date too. But querying a regular table is much faster than querying JSON columns.

2 Comments

Of course, I can group like this. But grouping over millions of lines is very slow because PAYABLE_INVOICE_LINES may have hundred of lines.
And change the structure is not a option because we don't know it. It can change from a document type to another. I really search for an optimization of the query on this context.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.