How can I prevent my SQL subquery from caching data in PostgreSQL?

Question

I'm trying to insert test data to test postgresql partionning, using Postgresql 11.

Here is my table:

CREATE TABLE measurement (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
);

Here is my statement to insert test data:

explain insert into measurement(
  city_id,
  logdate,
  peaktemp,
  unitsales
)
select
  ( 
    select (random() * 10000)::int + (generator*0) as city_id 
  ),
  ( 
    select 
        * 
    from 
        (select * from generate_series('2006-01-01'::timestamp,
                                       '2006-12-31'::timestamp,
                                       '1 day')) as rng
    order by
        random()
    limit 1
  ) as logdate,
  ( 
    select (random() * 10000)::int + (generator*0) as peaktemp 
  ),
  ( 
    select (random() * 10000)::int + (generator*0) as unitsales 
  )
from generate_series(1, 1000) as generator

The problem is: when I run this statement, the select subquery which generates the field logdate seems to be cached and the table measurement contains a single value for the field logdate.

So, if I run the following query:

select distinct(logdate)
from measurement

I only get a single value: 2006-02-10

How can I prevent the caching side-effect?

Your query seems to be gratuitously complicated, but if you don't want to rearrange it, you could just order by random()+generator to force the subselect to re-evaluated each time. — jjanes
– jjanes, Commented Oct 25, 2019 at 23:19

richyen · Accepted Answer · 2019-10-26 00:14:19Z

3

As you've observed, the ORDER BY is calculated once for the sub-query, such that for each row that's generated in your outer query, the selected row is going to be the same. As such, you won't be able to use sub-selects and ORDER BY to achieve what you're aiming for, even though you call random().

Try this:

SELECT
  (random() * 10000)::int AS city_id,
  '2006-01-01'::timestamp + random() * ('2006-12-31'::timestamp - '2006-01-01'::timestamp) AS logdate,
  (random() * 10000)::int AS peaktemp,
  (random() * 10000)::int AS unitsales
FROM
generate_series(1,1000);

I'm not sure what your requirements are, but I don't think you need to have 1) a bunch of sub-selects or 2) all the references to generator*0, so I've removed them too.

Disclosure: I work for EnterpriseDB (EDB)

edited Oct 26, 2019 at 0:14

answered Oct 25, 2019 at 20:26

richyen

10.3k4 gold badges18 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MMacphail Over a year ago

Thanks for the tip. Concerning 1) and 2), I was following a generate_series guide.

Tometzky · Accepted Answer · 2019-10-25 20:28:37Z

1

Maybe something like this:

select
  (random() * 10000)::int as city_id,
  (
    '2006-01-01'::timestamptz
    +random()*(
      ('2006-01-01'::timestamptz+'1 year'::interval)
      -'2006-01-01'::timestamptz
    )
  ) as logdate,
  (random() * 10000)::int as peaktemp,
  (random() * 10000)::int as unitsales
from generate_series(1, 1000) as generator;

Although I'm confused whether you want a timestamptz (you definitely don't want timestamp, without timezone, as this type is evil) or a date in logdate.

answered Oct 25, 2019 at 20:28

Tometzky

24.2k5 gold badges64 silver badges79 bronze badges

1 Comment

MMacphail Over a year ago

Thanks for your answer, I use Timestamp just to test stuff, but I probably wont use it in production code.

Collectives™ on Stack Overflow

How can I prevent my SQL subquery from caching data in PostgreSQL?

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related