Improve speed of complex postgres query in rails app

Question

I have a view in my app that visualizes a lot of data, and in the backend the data is produced using this query:

DataPoint Load (20394.8ms)  
SELECT communities.id as com, 
       consumers.name as con, 
       array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, 
       array_agg(consumption ORDER BY data_points.timestamp ASC) as cons 
FROM "data_points" 
     INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" 
     INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" 
     INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" 
     INNER JOIN "clusterings" ON "clusterings"."id" = "communities"."clustering_id" 
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) 
   AND "data_points"."interval_id" = $3 
   AND "clusterings"."id" = 1 
GROUP BY communities.id, consumers.id  
[["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]]

The query takes about 20 seconds to execute, which seems a bit excessive.

The code for generating the query is this:

res = {}
DataPoint.joins(consumer: {communities: :clustering} )
         .where('clusterings.id': self,
               timestamp: chart_cookies[:start_date] .. chart_cookies[:end_date],
               interval_id: chart_cookies[:interval_id])
         .group('communities.id')
         .group('consumers.id')
         .select('communities.id as com, consumers.name as con',
                'array_agg(timestamp ORDER BY data_points.timestamp asc) as tims',
                'array_agg(consumption ORDER BY data_points.timestamp ASC) as cons')
         .each do |d|
      res[d.com] ||= {}
      res[d.com][d.con] = d.tims.zip(d.cons)
      res[d.com]["aggregate"] ||= d.tims.map{|t| [t,0]}
      res[d.com]["aggregate"]  = res[d.com]["aggregate"].zip(d.cons).map{|(a,b),d| [a,(b+d)]}
end
res

And the relevant database models are the following:

  create_table "data_points", force: :cascade do |t|
    t.bigint "consumer_id"
    t.bigint "interval_id"
    t.datetime "timestamp"
    t.float "consumption"
    t.float "flexibility"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["consumer_id"], name: "index_data_points_on_consumer_id"
    t.index ["interval_id"], name: "index_data_points_on_interval_id"
    t.index ["timestamp", "consumer_id", "interval_id"], name: "index_data_points_on_timestamp_and_consumer_id_and_interval_id", unique: true
    t.index ["timestamp"], name: "index_data_points_on_timestamp"
  end

  create_table "consumers", force: :cascade do |t|
    t.string "name"
    t.string "location"
    t.string "edms_id"
    t.bigint "building_type_id"
    t.bigint "connection_type_id"
    t.float "location_x"
    t.float "location_y"
    t.string "feeder_id"
    t.bigint "consumer_category_id"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["building_type_id"], name: "index_consumers_on_building_type_id"
    t.index ["connection_type_id"], name: "index_consumers_on_connection_type_id"
    t.index ["consumer_category_id"], name: "index_consumers_on_consumer_category_id"
  end

  create_table "communities_consumers", id: false, force: :cascade do |t|
    t.bigint "consumer_id", null: false
    t.bigint "community_id", null: false
    t.index ["community_id", "consumer_id"], name: "index_communities_consumers_on_community_id_and_consumer_id"
    t.index ["consumer_id", "community_id"], name: "index_communities_consumers_on_consumer_id_and_community_id"
  end

  create_table "communities", force: :cascade do |t|
    t.string "name"
    t.text "description"
    t.bigint "clustering_id"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["clustering_id"], name: "index_communities_on_clustering_id"
  end

  create_table "clusterings", force: :cascade do |t|
    t.string "name"
    t.text "description"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
  end

How can I make the query execute faster? Is it possible to refactor the query to simplify it, or to add some extra index to the database schema so that it takes a shorter time?

Interestingly, a slightly simplified version of the query, which I use in another view, runs much faster, in only 1161.4ms for the first request and 41.6ms for the following requests:

DataPoint Load (1161.4ms)  
SELECT consumers.name as con, 
       array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, 
       array_agg(consumption ORDER BY data_points.timestamp ASC) as cons 
FROM "data_points" 
    INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" 
    INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" 
    INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" 
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) 
   AND "data_points"."interval_id" = $3 
   AND "communities"."id" = 100 GROUP BY communities.id, consumers.name  
[["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]]

Using command EXPLAIN (ANALYZE, BUFFERS) with query in dbconsole, I get the following output:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12.31..7440.69 rows=246 width=57) (actual time=44.139..20474.015 rows=296 loops=1)
   Group Key: communities.id, consumers.id
   Buffers: shared hit=159692 read=6148105 written=209
   ->  Nested Loop  (cost=12.31..7434.54 rows=246 width=57) (actual time=20.944..20436.806 rows=49728 loops=1)
         Buffers: shared hit=159685 read=6148105 written=209
         ->  Nested Loop  (cost=11.88..49.30 rows=1 width=49) (actual time=0.102..6.374 rows=296 loops=1)
               Buffers: shared hit=988 read=208
               ->  Nested Loop  (cost=11.73..41.12 rows=1 width=57) (actual time=0.084..4.443 rows=296 loops=1)
                     Buffers: shared hit=396 read=208
                     ->  Merge Join  (cost=11.58..40.78 rows=1 width=24) (actual time=0.075..1.365 rows=296 loops=1)
                           Merge Cond: (communities_consumers.community_id = communities.id)
                           Buffers: shared hit=5 read=7
                           ->  Index Only Scan using index_communities_consumers_on_community_id_and_consumer_id on communities_consumers  (cost=0.27..28.71 rows=296 width=16) (actual time=0.039..0.446 rows=296 loops=1)
                                 Heap Fetches: 4
                                 Buffers: shared hit=1 read=6
                           ->  Sort  (cost=11.31..11.31 rows=3 width=16) (actual time=0.034..0.213 rows=247 loops=1)
                                 Sort Key: communities.id
                                 Sort Method: quicksort  Memory: 25kB
                                 Buffers: shared hit=4 read=1
                                 ->  Bitmap Heap Scan on communities  (cost=4.17..11.28 rows=3 width=16) (actual time=0.026..0.027 rows=6 loops=1)
                                       Recheck Cond: (clustering_id = 1)
                                       Heap Blocks: exact=1
                                       Buffers: shared hit=4 read=1
                                       ->  Bitmap Index Scan on index_communities_on_clustering_id  (cost=0.00..4.17 rows=3 width=0) (actual time=0.020..0.020 rows=8 loops=1)
                                             Index Cond: (clustering_id = 1)
                                             Buffers: shared hit=3 read=1
                     ->  Index Scan using consumers_pkey on consumers  (cost=0.15..0.33 rows=1 width=33) (actual time=0.007..0.008 rows=1 loops=296)
                           Index Cond: (id = communities_consumers.consumer_id)
                           Buffers: shared hit=391 read=201
               ->  Index Only Scan using clusterings_pkey on clusterings  (cost=0.15..8.17 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=296)
                     Index Cond: (id = 1)
                     Heap Fetches: 296
                     Buffers: shared hit=592
         ->  Index Scan using index_data_points_on_consumer_id on data_points  (cost=0.44..7383.44 rows=180 width=24) (actual time=56.128..68.995 rows=168 loops=296)
               Index Cond: (consumer_id = consumers.id)
               Filter: (("timestamp" >= '2015-11-20 09:23:00'::timestamp without time zone) AND ("timestamp" <= '2015-11-27 09:23:00'::timestamp without time zone) AND (interval_id = 2))
               Rows Removed by Filter: 76610
               Buffers: shared hit=158697 read=6147897 written=209
 Planning time: 1.811 ms
 Execution time: 20474.330 ms
(40 rows)

The bullet gem returns the following warnings:

USE eager loading detected
  Community => [:communities_consumers]
  Add to your finder: :includes => [:communities_consumers]

USE eager loading detected
  Community => [:consumers]
  Add to your finder: :includes => [:consumers]

After removing the join with the clusterings table, the new query plan is the following:

EXPLAIN for: SELECT communities.id as com, consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "communities"."clustering_id" = 1 GROUP BY communities.id, consumers.id [["timestamp", "2015-11-29 20:52:30.926247"], ["timestamp", "2015-12-06 20:52:30.926468"], ["interval_id", 2]]
                                                                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10839.79..10846.42 rows=241 width=57)
   ->  Sort  (cost=10839.79..10840.39 rows=241 width=57)
         Sort Key: communities.id, consumers.id
         ->  Nested Loop  (cost=7643.11..10830.26 rows=241 width=57)
               ->  Nested Loop  (cost=11.47..22.79 rows=1 width=49)
                     ->  Hash Join  (cost=11.32..17.40 rows=1 width=16)
                           Hash Cond: (communities_consumers.community_id = communities.id)
                           ->  Seq Scan on communities_consumers  (cost=0.00..4.96 rows=296 width=16)
                           ->  Hash  (cost=11.28..11.28 rows=3 width=8)
                                 ->  Bitmap Heap Scan on communities  (cost=4.17..11.28 rows=3 width=8)
                                       Recheck Cond: (clustering_id = 1)
                                       ->  Bitmap Index Scan on index_communities_on_clustering_id  (cost=0.00..4.17 rows=3 width=0)
                                             Index Cond: (clustering_id = 1)
                     ->  Index Scan using consumers_pkey on consumers  (cost=0.15..5.38 rows=1 width=33)
                           Index Cond: (id = communities_consumers.consumer_id)
               ->  Bitmap Heap Scan on data_points  (cost=7631.64..10805.72 rows=174 width=24)
                     Recheck Cond: ((consumer_id = consumers.id) AND ("timestamp" >= '2015-11-29 20:52:30.926247'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926468'::timestamp without time zone))
                     Filter: (interval_id = 2::bigint)
                     ->  BitmapAnd  (cost=7631.64..7631.64 rows=861 width=0)
                           ->  Bitmap Index Scan on index_data_points_on_consumer_id  (cost=0.00..1589.92 rows=76778 width=0)
                                 Index Cond: (consumer_id = consumers.id)
                           ->  Bitmap Index Scan on index_data_points_on_timestamp  (cost=0.00..6028.58 rows=254814 width=0)
                                 Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926247'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926468'::timestamp without time zone))
(23 rows)

As requested in the comments, these are the query plans for the simplified query, with and without the restriction on communities.id

 DataPoint Load (1563.3ms)  SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name  [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
EXPLAIN for: SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
                                                                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=140992.34..142405.51 rows=51388 width=49)
   ->  Sort  (cost=140992.34..141120.81 rows=51388 width=49)
         Sort Key: communities.id, consumers.name
         ->  Hash Join  (cost=10135.44..135214.45 rows=51388 width=49)
               Hash Cond: (data_points.consumer_id = consumers.id)
               ->  Bitmap Heap Scan on data_points  (cost=10082.58..134455.00 rows=51388 width=24)
                     Recheck Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
                     ->  Bitmap Index Scan on index_data_points_on_timestamp_and_consumer_id_and_interval_id  (cost=0.00..10069.74 rows=51388 width=0)
                           Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
               ->  Hash  (cost=49.16..49.16 rows=296 width=49)
                     ->  Hash Join  (cost=33.06..49.16 rows=296 width=49)
                           Hash Cond: (communities_consumers.community_id = communities.id)
                           ->  Hash Join  (cost=8.66..20.69 rows=296 width=49)
                                 Hash Cond: (consumers.id = communities_consumers.consumer_id)
                                 ->  Seq Scan on consumers  (cost=0.00..7.96 rows=296 width=33)
                                 ->  Hash  (cost=4.96..4.96 rows=296 width=16)
                                       ->  Seq Scan on communities_consumers  (cost=0.00..4.96 rows=296 width=16)
                           ->  Hash  (cost=16.40..16.40 rows=640 width=8)
                                 ->  Seq Scan on communities  (cost=0.00..16.40 rows=640 width=8)
(19 rows)

and

  DataPoint Load (1479.0ms)  SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name  [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
EXPLAIN for: SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
                                                                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=140992.34..142405.51 rows=51388 width=49)
   ->  Sort  (cost=140992.34..141120.81 rows=51388 width=49)
         Sort Key: communities.id, consumers.name
         ->  Hash Join  (cost=10135.44..135214.45 rows=51388 width=49)
               Hash Cond: (data_points.consumer_id = consumers.id)
               ->  Bitmap Heap Scan on data_points  (cost=10082.58..134455.00 rows=51388 width=24)
                     Recheck Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
                     ->  Bitmap Index Scan on index_data_points_on_timestamp_and_consumer_id_and_interval_id  (cost=0.00..10069.74 rows=51388 width=0)
                           Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
               ->  Hash  (cost=49.16..49.16 rows=296 width=49)
                     ->  Hash Join  (cost=33.06..49.16 rows=296 width=49)
                           Hash Cond: (communities_consumers.community_id = communities.id)
                           ->  Hash Join  (cost=8.66..20.69 rows=296 width=49)
                                 Hash Cond: (consumers.id = communities_consumers.consumer_id)
                                 ->  Seq Scan on consumers  (cost=0.00..7.96 rows=296 width=33)
                                 ->  Hash  (cost=4.96..4.96 rows=296 width=16)
                                       ->  Seq Scan on communities_consumers  (cost=0.00..4.96 rows=296 width=16)
                           ->  Hash  (cost=16.40..16.40 rows=640 width=8)
                                 ->  Seq Scan on communities  (cost=0.00..16.40 rows=640 width=8)
(19 rows)

please use bullet gem(or run the query as raw sql in your sql client) to see where your code spends its time on. from the code, it looks like the each block will get heavy, if your query returns a big list, processing the big list in rails is slow. — Jin.X
– Jin.X, Commented Nov 27, 2017 at 10:57
It this query doing some sort of analyses on your operational system data? — a131
– a131, Commented Nov 27, 2017 at 11:06
@Jin.X: I added the output of EXPLAIN (ANALYZE, BUFFERS) on the query in the question. I don't think that ruby is the bottleneck, because it doesn't have high cpu load when the query is running. — user000001
– user000001, Commented Nov 27, 2017 at 11:37
@xeon131: Not sure what you mean by "operational system data", but the project is about clustering in energy systems, and this query depicts an overview of the consumption data for the current clustering, broken down into communities, if that makes sense. — user000001
– user000001, Commented Nov 27, 2017 at 11:40
@user000001 use the bullet gem, and run the query from the rails server, your server log will show information on what query is is performing in the background and the time query for it, it will help reviewer/yourself to find where the problem lies — Jin.X
– Jin.X, Commented Nov 27, 2017 at 15:55

nekogami · Accepted Answer · 2017-11-30 10:07:51Z

3

+50

Did you try adding an index on:

"data_points".timestamp" + "data_points".consumer_id"

OR

data_points".consumer_id only ?

answered Nov 30, 2017 at 10:07

nekogami

3262 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user000001 Over a year ago

This answer actually helped the most, bringing the query time down to 8 seconds

Edmund Lee Over a year ago

hmm interesting. you already have composite index on "timestamp", "consumer_id", "interval_id", adding index on only "data_points".timestamp" + "data_points".consumer_id" should just be redundent?

nekogami Over a year ago

Since the query only use 2 of these 3 fields, why would that be ? Also sorry to hear that it didn't go faster than that @user000001 :/

user000001 Over a year ago

@nekogami: Yes, I was surprised too that the double index is needed, when there is already the triple index, but it seems that in fact it is necessary.

nekogami Over a year ago

@user000001 so ? did you find a solution ?

|

ravioli · Accepted Answer · 2017-12-06 09:42:38Z

3

What version of Postgres are you using? In Postgres 10, they introduced native table partitioning. If your "data_points" table is very large, this may significantly speed up your query since you are looking at a time range:

WHERE (data_points.TIMESTAMP BETWEEN $1 AND $2)

One strategy you can look into is to add partitioning on the DATE value of the "timestamp" field. Then modify your query to include an extra filter so the partitioning kicks in:

WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) 
   AND (CAST("data_points"."timestamp" AS DATE) BETWEEN CAST($1 AS DATE) AND CAST($2 AS DATE))
   AND "data_points"."interval_id" = $3 
   AND "data_points"."interval_id" = $3 
   AND "communities"."clustering_id"  = 1

If your "data_points" table is very large and your "Timestamp" filtering range is small, this should help, since it would quickly filter out blocks of rows that don't need to be processed.

I haven't done this in Postgres, so I'm not sure how feasible, helpful, blah blah blah, it is. But it's something to look into :)

https://www.postgresql.org/docs/10/static/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE

edited Dec 6, 2017 at 9:42

answered Dec 1, 2017 at 11:18

ravioli

3,8213 gold badges17 silver badges35 bronze badges

1 Comment

user000001 Over a year ago

This looks promising, I 'll study the link and see how to modify the table structure to take advantage of portioning.

Adam Owczarczyk · Accepted Answer · 2017-11-29 15:20:13Z

2

Do you have foreign key on clusterings_id? Also - try to alter your condition like this:

SELECT communities.id as com, 
       consumers.name as con, 
       array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, 
       array_agg(consumption ORDER BY data_points.timestamp ASC) as cons 
FROM "data_points" 
     INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" 
     INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" 
     INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" 
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) 
   AND "data_points"."interval_id" = $3 
   AND "communities"."clustering_id"  = 1 
GROUP BY communities.id, consumers.id

answered Nov 29, 2017 at 15:20

Adam Owczarczyk

2,8621 gold badge19 silver badges21 bronze badges

1 Comment

user000001 Over a year ago

This is similar to @EdmundLee's answer, but if doesn't improve the performance of the query.

Edmund Lee · Accepted Answer · 2017-11-30 19:00:26Z

2

You don't need to join clusterings. So try removing that from your query, and use communities.clustering_id = 1 to replace that instead. This should get rid of 3 steps in your query plan. This should save you the most since you query plan is doing a few index scans on it inside of three nested loops.
You can also try to tweak the way your aggregate timestamp. I assume you don't need to aggregate them at a level of seconds?
I'd also remove the "index_data_points_on_timestamp" index since you already have a composite index. And this is practically useless. This should improve your write performance.

answered Nov 30, 2017 at 19:00

Edmund Lee

2,59424 silver badges32 bronze badges

8 Comments

user000001 Over a year ago

Point 1 makes sense, but I tried it and it didn't improve the performance. For point 2, I use the interval_id check, which limits the data_points to specified interval (e.g. 15min, hourly,daily).

Edmund Lee Over a year ago

@user000001 can you post the query planner result after you take clusterings out?

user000001 Over a year ago

Sure, this is the query that is executed:

DataPoint Load (39700.3ms)  SELECT communities.id as com, consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons

...

user000001 Over a year ago

...

FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "communities"."clustering_id" = 1 GROUP BY communities.id, consumers.id  [["timestamp", "2015-11-29 08:00:50.371546"], ["timestamp", "2015-12-06 08:00:50.371951"], ["interval_id", 2]]

Edmund Lee Over a year ago

@user000001 sorry, I mean the planner from postgres. can you share that?

|

Victor Di Leo · Accepted Answer · 2017-12-05 21:10:00Z

0

The index on data_points.timestamp is not being used, perhaps due to the ::timestamp conversion.

I wonder if altering the column datatype or creating a funtional index would help.

EDIT - the datetime in your CREATE TABLE is how Rails chooses to display the Postgres timestamp data type, I guess, and so there may be no conversion taking place after all.

Nevertheless, the index on timestamp is not being used but depending on your data distribution this could be a very smart choice by the optimizer.

edited Dec 5, 2017 at 21:10

answered Dec 5, 2017 at 19:54

Victor Di Leo

826 bronze badges

Comments

khusnetdinov · Accepted Answer · 2017-12-06 10:05:07Z

So here we have Postgres 9.3 and long query. Well Before query you have to ensure that you have optimal settings for you data base and suitable for your read and write percentage to disk, type of disk ssd or old hard, and you don't switch autovacuum, you check bloating for tables and indexes and you have good selectivity for indexes that are used for building optimal plans.

Check row types and size filled in row. Change type of row also cen reduce size of table and time.

So now you ensure in all this. Now lets think in way of how Postgres execute and how we can reduce time and efforts. ORM good for simple queries, but if you try to do complicated query you have to use execute by sql methods and keep in in Query Service Objects.

Write simpler queries as possible in sql Postgres also waste time for parse queries.

Check indexes on on all joins fields. Use explain analyze to check that now you have optimal scanning methods.

Next point. You try to do 4 joins! Postgres try to find optimal query plan in 4! times (4 factorial times!) let think to use subqueries or tables with predefined table for this selection.

Use separated query or function for 4 joins (try subqueries):

SELECT *
FROM "data_points" as predefined
INNER JOIN "consumers"
ON "consumers"."id" ="data_points"."consumer_id" 
INNER JOIN "communities_consumers"
ON "communities_consumers"."consumer_id" = "consumers"."id" 
INNER JOIN "communities"
ON "communities"."id" = "communities_consumers"."community_id" 
INNER JOIN "clusterings"
ON "clusterings"."id" "communities"."clustering_id" 

WHERE "data_points"."interval_id" = 2 
AND "clusterings"."id" = 1

2) Next (don't use variables just pass)

SELECT *
FROM predefined
WHERE "data_points"."timestamp"
BETWEEN "2015-11-20 09:23:00"
AND 2015-11-27 09:23:00

3) You have 3 times of asking data_points for query, you need less:

array_agg(timestamp ORDER BY data_points.timestamp asc) as tims
array_agg(consumption ORDER BY data_points.timestamp ASC) as cons
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)

You should remember long query it's not all about query, at about settings, ORM usage, sql, and how Postgres works with it all.

Collectives™ on Stack Overflow

Improve speed of complex postgres query in rails app

6 Answers 6

6 Comments

1 Comment

1 Comment

8 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

1 Comment

1 Comment

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related