PostgreSQL Index performance issue

Question

I have a PostgreSQL database which contains a lot of rows ~160,000,000. The database is basically a set of network logs. There is a column called time which is a time stamp. Basically for every unique time there is a log for each machine on the network. Thus there are multiple rows with the same time stamp. i.e.

time    ip             value
2:00    192.168.1.1    5
2:00    192.168.1.2    4
2:00    192.168.1.3    5
3:00    192.168.1.1    3
3:00    192.168.1.2    5
3:00    192.168.1.3    2
etc     etc

Now I have created an index for this database by time as this is the field which I always query with relation to.

i.e. select * from networklogs where time = '2:00' and value = 5

Without using the index every query takes approximately 2 mins due to the query having to scan the whole database. I created the index such that my query will find the subset of data containing a matching time and then filter based on another value from here and thus I hoped the query would be a lot faster.

However the queries seem to take longer now...

This is strange because in pgAdmin it explains that the query will use the time index and that the time index narrows the data to ~800,000 items from a total 160,000,000, and then the filter on the query narrows this data down to 1,700 items. This explanation takes 17ms however if I run the query it takes 3 mins.

This should be faster than searching for two matching criteria on 160,000,000 items!

I cannot work out why it isn't faster, I'm wondering if even though pgAdmin explains the query as using the time index does it actually use it when the query is performed?

Does anyone have any ideas or suggestions?

Thanks, James

UPDATE: I have run an EXPLAIN ANALYZE on the following command:

explain analyze select latitude,longitude from networklogs where value = 5 and time = '2:00'

The results are:

"Bitmap Heap Scan on networklogs  (cost=13178.17..1488620.94 rows=848 width=19) (actual time=3507.923..278760.509 rows=33 loops=1)"
"  Recheck Cond: (time = '2:00'::timestamp without time zone)"
"  Rows Removed by Index Recheck: 38302021"
"  Filter: (value = 5)"
"  Rows Removed by Filter: 882873"
"  ->  Bitmap Index Scan on timeindex  (cost=0.00..13177.95 rows=530111 width=0) (actual time=695.771..695.771 rows=882906 loops=1)"
"        Index Cond: (time = '2:00'::timestamp without time zone)"
"Total runtime: 278764.148 ms"

Do you have any privilege to back up old data from certain time point and only keep x amount of recent data? — bonCodigo
– bonCodigo, Commented Dec 3, 2012 at 16:03
It's actually slower now than before (tried a couple of queries for comparison) :P I'm going to try dbrobins suggestion. — James Elder
– James Elder, Commented Dec 3, 2012 at 16:04
Additionally, PostgreSQL can use subsets of multicolumn indexes (particularly the leftmost columns). Thus, if you always restrict by time and sometimes also by value, an index on (time,value) would be useful for both queries. — willglynn
– willglynn, Commented Dec 3, 2012 at 16:04
@bonCodigo Actually I am not working in real time, so I cant edit the logs. I am trying to create a software to analyze the past logs. — James Elder
– James Elder, Commented Dec 3, 2012 at 16:21

greg · Accepted Answer · 2012-12-03 16:41:19Z

3

If these are mostly the parameters you are going to perform searchs against, I would advise to create a multiple index on them. Because, in your actual situation, the database has an index on time so it is able to know which rows match this condition but then it has to fetch them from all the different places given by the index and then check the secondary condition.

CREATE INDEX time_and_value ON networklogs (time, value);

Using this index, the database will not have to fetch data in order to check another condition, it will simply fetch the data that it knows already match the search criteria. Order is important of course.

I see you are using a latitude and longitude data, maybe using the point type would be good for that so you would be able to use all the geometric operators that come by default with Postgres. These kind of data can be indexed as well.

answered Dec 3, 2012 at 16:41

greg

3,5351 gold badge27 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

willglynn Over a year ago

Looking at the EXPLAIN output, the time index matches 882,906 rows in 0.7 seconds. The database then takes another 27.1 seconds to fetch all those rows (presumably because of lots of disk seeks), only to discard 99.996% of them. You definitely need an index on both columns.

willglynn Over a year ago

I'm sorry, that's 278 seconds (3.16 ms/row), not 27 seconds. Sounds like random reads to me.

James Elder Over a year ago

Thanks for the answers. Well I would therefore need to create a few multiple indexes, is this possible? For example my actual database has more columns than the example ones I showed above. I would need some indexes like time & value, time & flag, time & noconn...

willglynn Over a year ago

Yes. You can create as many indexes as you want, provided you have enough disk space and that your database server can keep up with the extra work on insert/update/delete operations.

James Elder Over a year ago

Thanks I will modify the database for these indexes then. And thanks for the explanation as I understand how the indexing system works a lot better now. @greg I just wanted to say that for my solution I am not allowed to modify the datatypes, however I will bear this in mind for the future.

|

dbrobins · Accepted Answer · 2012-12-03 15:54:45Z

2

PostgreSQL has EXPLAIN and EXPLAIN ANALYZE to examine how a query is run. It is likely that PG Admin uses EXPLAIN to tell you how the query will run; you may get more accurate results using EXPLAIN ANALYZE, which actually runs the query and produces statistics from the actual run. You could at least see where time was spent, even if it is using the indexes as expected.

answered Dec 3, 2012 at 15:54

dbrobins

4991 gold badge3 silver badges9 bronze badges

2 Comments

James Elder Over a year ago

I ran the explain analyze but could you please explain to me what this means:

"Bitmap Heap Scan on networklogs  (cost=13178.17..1488620.94 rows=848 width=19) (actual time=3507.923..278760.509 rows=33 loops=1)"

in terms of the time taken and the action performed? this is the first line from the output by the way...

dbrobins Over a year ago

A bitmap heap scan is like a sequential scan except it restricts the scan to a range, but still fetches a lot of rows. As another answer points out, you would benefit from adding value to the index.

Collectives™ on Stack Overflow

PostgreSQL Index performance issue

2 Answers 2

7 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related