Create table from loop output Oracle SQL

Question

I need to pull a random sample from a table of ~5 million observations based on 175 demographic options. The demographic table is something like this form:

Basically I need this same demographic breakdown randomly sampled from the 5M row table. For each demographic I need a sample of the same one from the larger table but with 5x the number of observations (example: for demographic 1 I want a random sample of 200).

SELECT  *
FROM    (
        SELECT  *
        FROM    my_table
        ORDER BY
                dbms_random.value
        )
WHERE rownum <= 100;

I've used this syntax before to get a random sample but is there any way I can modify this as a loop and substitute variable names from existing tables? I'll try to encapsulate the logic I need in pseudocode:

for (each demographic_COLUMN in TABLE1) 
    select random(5*num_obs_COLUMN in TABLE1) from ID_COLUMN in TABLE2
/*somehow join the results of each step in the loop into one giant column of IDs */

Do you have one table or two? I find it hard to follow the data structure. What is the "numeric value associated with these percentages"? — Gordon Linoff
– Gordon Linoff, Commented Sep 27, 2018 at 14:55
@GordonLinoff I have two. The first table has the demographic, percentages, number of observations ("numberic value")--this table is 175 rows. The second table is what I want to draw a random sample from and is ~5 million rows — S420L
– S420L, Commented Sep 27, 2018 at 14:57
@GordonLinoff I edited my question to try to bring clarity. You don't have to understand it completely--the main thing is I don't know where to start with looping over columns from different tables and aggregating the results. — S420L
– S420L, Commented Sep 27, 2018 at 15:23
Please edit your question and add the create table statement for the table in question. Your sample data only shows two columns (one being the PK?) — user330315
– user330315, Commented Sep 27, 2018 at 17:03
I'm still not clear what loops have got to do with the requirement (looping through columns?). And why are you ordering by dbms_random.value instead of using a normal sample clause? Some examples would help a lot. — William Robertson
– William Robertson, Commented Sep 28, 2018 at 9:33

Alex Poole · Accepted Answer · 2018-09-27 15:52:03Z

You could join your tables (assuming the 1-175 demographic value exists in both, or there is an equivalent column to join on), something like:

select id
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage

Each row in the main table is given a random pseudo-row-number within its demographic (via the analytic row_number()). The outer query then uses the relevant percentage to select how many of those randomly-ordered rows for each demographic to return.

I'm not sure I've understood how you're actually picking exactly how many of each you want, so that probably needs to be adjusted.

Demo with a smaller sample in a CTE, and matching smaller match condition:

-- CTEs for sample data
with my_table (id, demographic) as (
  select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
            select 1, 40, '4%' from dual
  union all select 2, 30, '3%' from dual
  union all select 3, 30, '3%' from dual
  -- ...
  union all select 174, 2, '.02%' from dual
  union all select 175, 1, '.01%' from dual
)
-- actual query
select demographic, percentage, id, rn
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage;

DEMOGRAPHIC PERCENTAGE         ID         RN
----------- ---------- ---------- ----------
          1         40      94150          1
          1         40      36925          2
          1         40     154000          3
          1         40      82425          4
...
          1         40     154350        199
          1         40     126175        200
          2         30      36051          1
          2         30       1051          2
          2         30     100451          3
          2         30      18026        149
          2         30     151726        150
          3         30     125302          1
          3         30     152252          2
          3         30     114452          3
...
          3         30     104652        149
          3         30      70527        150
        174          2      35698          1
        174          2      67548          2
        174          2     114798          3
...
        174          2      70698          9
        174          2      30973         10
        175          1     139649          1
        175          1     156974          2
        175          1     145774          3
        175          1      97124          4
        175          1      40074          5

(you only need the ID, but I'm including the other columns for context); or more succinctly:

with my_table (id, demographic) as (
  select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
            select 1, 40, '4%' from dual
  union all select 2, 30, '3%' from dual
  union all select 3, 30, '3%' from dual
  -- ...
  union all select 174, 2, '.02%' from dual
  union all select 175, 1, '.01%' from dual
)
select demographic, percentage, count(id) as ids, min(id) as min_id, max(id) as max_id
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage
group by demographic, percentage
order by demographic;

DEMOGRAPHIC PERCENTAGE        IDS     MIN_ID     MAX_ID
----------- ---------- ---------- ---------- ----------
          1         40        200        175     174825
          2         30        150          1     174126
          3         30        150       2452     174477
        174          2         10      23448     146648
        175          1          5      19074     118649

db<>fiddle

The first block of code worked beautifully just by substituting my column/table names, thanks for understanding what I was trying to ask!

Collectives™ on Stack Overflow

Create table from loop output Oracle SQL

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related