0

I have a table with 36.64 million entries. The table definition as follow:

id integer, PK
attribute, varchar 255
value, varchar 255
store_id, integer
timestamp, timestamp without timezone
mac_address, varchar 255

plus, mac_address and timestamp column has index.

the query:

select count(*) from table where mac_address = $1 and timestamp between $2 and $3
select * from table where mac_address = $1 and timestamp between $2 and $3

If I run this in pgAdmin, it took a total of 10 seconds. If I run this using JPA, it took more than 40 seconds. There is no EAGER loading.

I've look into SimpleJpaRepository code. it is exactly these two query, a count() and a getResultList()

questions: 1. looks like timestamp index is not used in both pgAdmin and JPA. I've checked this with ANALYZE and EXPLAIN. But why? 2. Why does JPA needs 10x more time? ORM adds overhead, but 10 times? 3. How do I improve it?

EDIT 1:

Maybe the count() from JPA is not using index scan, it use sequential = slow. my postgresql version is 9.5.

EDIT 2: in JPA, it is using setFirstResult() and setMaxResult() to get a total of 100 entries. From total of 259242

I try to mimic it with LIMIT and OFFSET, but I didn't see these keywords in JPA query. Maybe JPA is getting all result and then do paging in memory, which in turns cause performance issue?

The first execute of count() query takes 19 to 55 seconds using pgAdmin.

The EXPLAIN of the two query.

count()

Aggregate  (cost=761166.10..761166.11 rows=1 width=4) (actual time=1273.871..1273.871 rows=1 loops=1)
  Output: count(id)
  Buffers: shared read=92986 written=56
  ->  Bitmap Heap Scan on public.device_messages playerstat0_  (cost=11165.36..760309.47 rows=342650 width=4) (actual time=76.217..1258.389 rows=259242 loops=1)
        Output: id, attributecode, attributevalue, store_id, "timestamp", mac_address
        Recheck Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
        Rows Removed by Index Recheck: 6281401
        Heap Blocks: exact=36622 lossy=55083
        Buffers: shared read=92986 written=56
        ->  Bitmap Index Scan on device_messages_mac_address_timestamp_idx  (cost=0.00..11079.70 rows=342650 width=0) (actual time=69.636..69.636 rows=259242 loops=1)
              Index Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
              Buffers: shared read=1281
Planning time: 0.138 ms
Execution time: 1274.275 ms

select

Limit  (cost=3362.52..5043.49 rows=100 width=34) (actual time=30.291..42.846 rows=100 loops=1)
  Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
  Buffers: shared hit=15447 read=1676"
  ->  Index Scan Backward using device_messages_pkey on public.device_messages playerstat0_  (cost=0.57..5759855.56 rows=342650 width=34) (actual time=2.597..42.834 rows=300 loops=1)
        Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
        Filter: ((playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone) AND ((playerstat0_.mac_address)::text = '0011E004CA34'::text))
        Rows Removed by Filter: 154833
        Buffers: shared hit=15447 read=1676
Planning time: 0.180 ms
Execution time: 42.878 ms

EDIT 3: After more testing, it is confirmed that the cause is count(). select with limit and offset is pretty fast. The count() alone could take up to a minute. mentioned here postgresql slow counting

While the count estimate function works (ROWS from query plan), I couldn't call that from JPA.

EDIT 3: I kinda solve the problem, but not completely.

About select, after creating index which matches the query, it actually runs quite fast, 2~5 seconds. But that is without sorting. Sorting adds another process step to the query.

The count() is slow, and is confirmed by postgresql document. the MVCC force count() to do a heap scan, similar to sequence scan to the whole table.

The final problem which I still not sure it that the query on production server is mush slower than testing server. 60 seconds on production and 5 seconds on testing server. With same table size and data. But the big difference is production server has about 20+ insert operation per second. Testing server has no insert operation going on. I am guessing maybe the insert operation needs a write lock and so the query is slow because it has to wait for the lock?

8
  • 1
    1. How many record you are getting? 2. How did you get the "10x" more time? I suspect you are getting large number of data, and you are comparing the actual retrieval in JPA vs just the SQL execution (without fetching all data) in some SQL client Commented May 9, 2018 at 10:38
  • not familiar with PG. One reason for not using timestamp index may be related to the type of index it is using. Is it a b-tree or a hash index? Other DBMS may also decide which index to use based on the cost calculated based on statistic of data. That may be the reason Commented May 9, 2018 at 10:40
  • @AdrianShum I guess you are right, pgAdmin may not be reading all records until I scroll down, so that it is much faster. I am using default B-tree index. Commented May 9, 2018 at 11:15
  • 2
    Please edit your question and add the execution plans generated using explain (analyze, buffers). Formatted text please, no screen shots Commented May 9, 2018 at 11:23
  • 2
    Which indexes exactly do you have on that table? Is that a combined index on (mac_address, timestamp) or two single-column indexes? And how many rows does the query return? Maybe it's a simple case of JPA taking too long to process a large result. Commented May 9, 2018 at 11:24

1 Answer 1

1

You should be able to get better performance with an index of both mac_address and timestamp in the same index:

CREATE INDEX [CONCURRENTLY] ON table (mac_address, timestamp);

The reason the timestamp index is not used is because it would need to cross reference it with the mac_address index to find the correct rows (which would actually take longer than just looking up the rows directly)

I have no experience with JPA so I can't really say why it's slower.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.