0

I have a simple Postgres Table. A simple query to count total records takes ages. I have 7.5 millions records in table, I using 8 vCPUs, 32 GB memory machine. Database is in same machine.

Edit: add query.

Following query is very slow:

SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000

Output of explain

$ explain SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000

---------------------------------------------------------------------------------------------------------
 Limit  (cost=5.42..49915.17 rows=10000 width=1985)
   ->  Index Scan using import_csv_id_idx on import_csv  (cost=0.43..19144730.02 rows=3835870 width=1985)
         Filter: (NOT processed)
(3 rows)

My table is as below:

      Column       |      Type      | Collation | Nullable | Default 
-------------------+----------------+-----------+----------+---------
 id                | integer        |           |          | 
 name              | character(500) |           |          | 
 domain            | character(500) |           |          | 
 year_founded      | real           |           |          | 
 industry          | character(500) |           |          | 
 size_range        | character(500) |           |          | 
 locality          | character(500) |           |          | 
 country           | character(500) |           |          | 
 linkedinurl       | character(500) |           |          | 
 employees         | integer        |           |          | 
 processed         | boolean        |           | not null | false
 employee_estimate | integer        |           |          | 
Indexes:
    "import_csv_id_idx" btree (id)
    "processed_idx" btree (processed)

Thank you

Edit 3:

# explain analyze SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000;
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5.42..49915.33 rows=10000 width=1985) (actual time=8331.070..8355.556 rows=10000 loops=1)
   ->  Index Scan using import_csv_id_idx on import_csv  (cost=0.43..19144790.06 rows=3835870 width=1985) (actual time=8331.067..8354.874 rows=10001 loops=1)
         Filter: (NOT processed)
         Rows Removed by Filter: 3482252
 Planning time: 0.081 ms
 Execution time: 8355.925 ms
(6 rows)

explain (analyze, buffers)

# explain (analyze, buffers) SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000;


                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5.42..49915.33 rows=10000 width=1985) (actual time=8236.899..8260.941 rows=10000 loops=1)
   Buffers: shared hit=724036 read=2187905 dirtied=17 written=35
   ->  Index Scan using import_csv_id_idx on import_csv  (cost=0.43..19144790.06 rows=3835870 width=1985) (actual time=8236.896..8260.104 rows=10001 loops=1)
         Filter: (NOT processed)
         Rows Removed by Filter: 3482252
         Buffers: shared hit=724036 read=2187905 dirtied=17 written=35
 Planning time: 0.386 ms
 Execution time: 8261.406 ms
(8 rows)
7
  • Please edit your question and add the execution plan generated using explain (analyze, buffers, format text) not_ just a "simple" explain Commented Apr 23, 2020 at 9:57
  • ok, but query like this is also very slow SELECT * FROM import_csv WHERE processed = False ORDER BY id ASC OFFSET 1 LIMIT 10000 Commented Apr 23, 2020 at 10:28
  • sorry a_horse_with_no_name, I edited question, removed explain part actually. Commented Apr 23, 2020 at 10:31
  • The execution plan is important. Please add the one generated using explain (analyze, buffers) that will contain more information that a "simple" explain Commented Apr 23, 2020 at 10:37
  • 1
    But that's the output of a "simple" explain, not the output of explain (analyze, buffers) Commented Apr 23, 2020 at 14:03

1 Answer 1

1

It is slow because it has to dig through 3482252 rows which fail the processed = False criterion before finding the 10001st on which passes, and apparently all those failing rows are scattered randomly about the table leading to a lot of slow IO.

You either need an index on (processed, id), or on (id) where processed = false

If you do the first of these, you can drop the index on processed alone, as it would no longer be independently useful (if it ever were to start with).

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, I wonder reason for (processed, id) to be indexed. I just want to know more

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.