PostgreSQL/Psycopg2 indexing array in the size of thousands

Question

I will have an array the size of 16,380, and I want to index them so I can call aggregate functions over them using 16,380 indices.

For example, I have a table:

+----+-------+----------------------------+
| id | stuff |   large_array_of_values    |
+----+-------+----------------------------+
|  0 |     5 | {1.0, NULL, 3.0, 4.0, ...} |
|  1 |     2 | {2.0, 3.0, 4.0, 5.0, ...}  |
|  2 |     1 | {3.0, 4.0, 5.0, NULL, ...} |
+----+-------+----------------------------+

I want to select all the values from large_array_of_values column with specific index, say index 15,123 and apply an aggregate function such as population standard deviation across index 15,123 of every row in the table.

E.g., in the above table the population standard deviation should be 0.5 across index 1 of each array on every row (NULL, 3.0, 4.0) which ignores NULL values in PostgreSQL.

My main question is how efficient is indexing such a large number of values in PostgreSQL?

I think that your kind of needs is best served by a columnar database. — joanolo
– joanolo, Commented Mar 21, 2017 at 22:41

joanolo · Accepted Answer · 2017-03-22 12:23:25Z

I dont' think indexes will help. I am not sure why, but I am just being experimental...

First, we create a table with your structure, and fill it with random data (10.000 rows, and a vector of 1.000 columns).

 CREATE TABLE t
 (
     id integer /* PRIMARY KEY */, 
     stuff integer, 
     large_array_of_values float[]
 ) ;

 CREATE OR REPLACE FUNCTION random_vector() RETURNS float[] AS
 $$
     select 
        array_agg(random())
     from 
        generate_series (1, 1000)
 $$
 LANGUAGE SQL ;

 INSERT INTO t
    (id, stuff, large_array_of_values)
 SELECT
    id, random()*10000,  random_vector()
 FROM
        generate_series(1, 10000) AS i(id) ;

At this point we create one sample index for values at index [32] of the vector (plus the id!):

 CREATE INDEX 
    idx_32 ON t(id, (large_array_of_values[32]));

Now, we ask PostgreSQL to analyze the following query and explain it:

 EXPLAIN ANALYZE
 SELECT
    avg(large_array_of_values[32])
 FROM
    t
 WHERE
    id BETWEEN 5000 and 7500 
     AND (large_array_of_values[32]) > 0.32 ;

 | QUERY PLAN                                                                                                              |
 | :---------------------------------------------------------------------------------------------------------------------- |
 | Aggregate  (cost=46.94..46.95 rows=1 width=8) (actual time=54.871..54.871 rows=1 loops=1)                               |
 |   ->  Bitmap Heap Scan on t  (cost=4.91..46.89 rows=17 width=32) (actual time=0.392..1.204 rows=1732 loops=1)           |
 |         Recheck Cond: ((id >= 5000) AND (id <= 7500) AND (large_array_of_values[32] > '0.32'::double precision))        |
 |         Heap Blocks: exact=20                                                                                           |
 |         ->  Bitmap Index Scan on idx_32  (cost=0.00..4.91 rows=17 width=0) (actual time=0.364..0.364 rows=1732 loops=1) |
 |               Index Cond: ((id >= 5000) AND (id <= 7500) AND (large_array_of_values[32] > '0.32'::double precision))    |
 | Planning time: 0.405 ms                                                                                                 |
 | Execution time: 55.013 ms                                                                                               |

dbfiddle here

The query plan is not using the index (idx_32) to perform an index-only scan, which I guess is what you wanted, even if idx_32 is a covering index for such a query.

VACUUMing, to make sure that the visibility map was up-to-date didn't have any effect. I couldn't find any explicit restrictions on index-only scans having to refer to columns (and not Indexes on Expressions), but it appears that they aren't used in this second case.

Comparison with a non-vector column

 CREATE TABLE t
 (
     id integer /* PRIMARY KEY */, 
     stuff integer, 
     a_value float
 ) ;

 INSERT INTO t
    (id, stuff, a_value)
 SELECT
    id, random()*10000,  random()
 FROM
    generate_series(1, 10000) AS i(id) ;

 CREATE INDEX idx_value ON t(id, a_value);

 VACUUM ANALYZE VERBOSE t ;

In this case, the covering index is actually used, and you get an "index-only scan".

 EXPLAIN ANALYZE
 SELECT
    avg(a_value)
 FROM
    t
 WHERE
    id BETWEEN 5000 and 7500 AND (a_value > 0.1)

 | QUERY PLAN                                                                                                                    |
 | :---------------------------------------------------------------------------------------------------------------------------- |
 | Aggregate  (cost=103.67..103.69 rows=1 width=8) (actual time=1.139..1.140 rows=1 loops=1)                                     |
 |   ->  Index Only Scan using idx_value on t  (cost=0.29..98.05 rows=2251 width=8) (actual time=0.026..0.655 rows=2247 loops=1) |
 |         Index Cond: ((id >= 5000) AND (id <= 7500) AND (a_value > '0.1'::double precision))                                   |
 |         Heap Fetches: 0                                                                                                       |
 | Planning time: 0.184 ms                                                                                                       |
 | Execution time: 1.179 ms                                                                                                      |

dbfiddle here

Stack Exchange Network

PostgreSQL/Psycopg2 indexing array in the size of thousands

1 Answer 1

Your Answer

Hot Network Questions

PostgreSQL/Psycopg2 indexing array in the size of thousands

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions