2

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:

data: {
  "lastUpdatedTime": "2016-12-26T12:09:43.901Z",
  "UID": "2c5bb7fd-1a00-4988-8d92-ffaa52ebc20d",
  "data": {
    "country": "UK",
    "verified_at": "2017-01-01T23:49:10.217Z"
  }
}

The data format cannot be changed since this is legacy information.

I need to obtain all accounts where the country is UK, the verified_at value is not null and the lastUpdatedTime value is greater than some given value.

So far, I have the following query:

SELECT * FROM "accounts"
WHERE (data @> '{ "data": { "country": "UK" } }')
AND (data->'data' ? 'verified_at')
AND ((data->'data' ->> 'verified_at') is not null)
AND (data ->>'lastUpdatedTime' > '2016-02-28T05:49:08.511846')
ORDER BY data ->>'lastUpdatedTime' LIMIT 100 OFFSET 0;

And the following indexes:

"accounts_idxgin" gin (data)
"accounts_idxgin_on_data" gin ((data -> 'data'::text))

I've managed to get the query time down to about 1000 to 4000ms

Here is the analyze from the query:

 Bitmap Heap Scan on accounts  (cost=41.31..6934.50 rows=9 width=1719)
                               (actual time=7.273..1067.657 rows=23190 loops=1)
   Recheck Cond: ((data -> 'data'::text) ? 'verified_at'::text)
   Filter: ((((data -> 'data'::text) ->> 'verified_at'::text) IS NOT NULL)
           AND ((data ->> 'lastUpdatedTime'::text) > '2016-02-01 05:49:08.511846'::text)
           AND (((data -> 'data'::text) ->> 'country'::text) = 'UK'::text))
   Rows Removed by Filter: 4
   Heap Blocks: exact=16039
   ->  Bitmap Index Scan on accounts_idxgin_on_data  (cost=0.00..41.30 rows=1773 width=0)
       (actual time=4.618..4.618 rows=23194 loops=1)
         Index Cond: ((data -> 'data'::text) ? 'verified_at'::text)
 Planning time: 0.448 ms
 Execution time: 1069.344 ms
(9 rows)

I have the following questions

  1. Is there anything I can do to further speed up this query?
  2. What is the correct way to speed up a field is not null query with JSONB? I ended up using the existence operator with (data->'data' ? 'verified_at') to filter out a large number of non-matching records, because much of my data doesn't have verified_at as a top level key. This increased the speed of the query, but I'm wondering if there's a general approach to optimizing this type of query.
  3. In order to use the existence operator with (data->'data' ? 'verified_at'), I needed to add another index on ((data -> 'data'::text)). I already had an index on gin (data), but the existence operator didn't use this. Why is that? I thought the existence and containment operators would use this index.

3 Answers 3

5

3: Not really. This case is explicitly mentioned in the docs. When you have an index on the column data, it is only used, when you query your table, like data @> '...' or data ? '...'. When you have an index on the expression (data -> 'data'), these queries can take advantage of it: (data -> 'data') @> '...' or (data -> 'data') ? '...'.

2: usual jsonb indexes won't help during a (jsonb_col -> '<key>') is [not] null query at all. And unfortunately, you cannot use jsonb_col @> '{"<key>":null}' either, because the JSON object might lack the key entirely. Also reverse use of the index (for is not null) is not possible at all. But there may be a trick...

1: Not much. There may be some improvements, but don't expect huge performance advantages. So here them go:

You can use the jsonb_path_ops operator class instead of the (default) jsonb_ops. This should mean a little improvement in performance, but they cannot use the existence operator (?). But we won't need it anyway.

You have a single, index-unfriendly, boolean typed expression, which slows you down. Thankfully you can use a partial index here if you only interested in true values.

So, your index should look something like this:

create index accounts_idxgin_on_data
  on accounts using gin ((data -> 'data') jsonb_path_ops)
  where (data -> 'data' ->> 'verified_at') is not null;

With this index, you can use the following query:

select   *
from     accounts
where    (data -> 'data') @> '{"country":"UK"}'
and      (data -> 'data' ->> 'verified_at') is not null
and      (data ->> 'lastUpdatedTime') > '2016-02-28T05:49:08.511Z'
order by data ->>'lastUpdatedTime';

Note: for proper timestamp comparisons, you should use (data ->> 'lastUpdatedTime')::timestamptz > '2016-02-28T05:49:08.511Z'.

http://rextester.com/QWUW41874

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the response pozs. I've tried your index and query and it gives me an execution time of around 1000ms, and if I perform a count(*), I get about the same result. From the partial index that I came up with in my answer, I was able to get the query time to about 600ms and a count(*) to 450ms, so I'm going to stick with my index and query. In regards to your answer for question 3) the docs didn't make sense to me at the time, but now I understand it, thanks for the clarification. Your answer to 2) verifies my suspicions. Thanks again for your help!
3

After playing around a bit more, I've managed to reduce my query time from around 1000ms to 350ms by creating the following partial index:

 CREATE INDEX index_accounts_partial_on_verified_at
 ON accounts ((data->'data'->'verified_at'))
 WHERE (data->'data'->>'verified_at') IS NOT NULL
 AND (data->'data' ? 'verified_at')
 AND (data->'data'->>'country' = 'UK');

I was able to hardcode some of the values in this index, such as country=UK because I only need to consider UK accounts for this query. I was also able to remove the index on ((data->'data')) which was 258MB, and replace it with the partial index which is only 1360 kB!

For anyone interested, I found the details for building a partial JSONB index from here

Comments

1

Use the path access operator for faster access to lower-level objects:

SELECT * FROM "accounts"
WHERE data #>> '{data, country}' = 'UK'
  AND data #>> '{data, verified_at}' IS NOT NULL
  AND data ->> 'lastUpdatedTime' > '2016-02-28T05:49:08.511846'
ORDER BY data ->> 'lastUpdatedTime' LIMIT 100 OFFSET 0;

The index only works on the top-level key. So, with an index on column data queries like data @> [[key]] are supported. However, for a query on data -> 'data' ? 'verified_at' you need an index on data->'data'.

Two more points:

  • I don't think it is necessary to test for the presence of verified_at. If it is not there it simply comes out as NULL so it gets caught by the same test.
  • Comparing string representations of timestamp values may work if the JSON value is properly and consistently formatted. Cast to timestamp to be on the safe side.

1 Comment

Thanks for the response Patrick. I've tried your query suggestion, but the path access operator didn't make any difference, and removing the existence operator condition (data->'data' ? 'verified_at') resulted in a much slower query because the ((data -> 'data')) index wasn't used, so the execution time increased to 15912.736 ms.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.