2

I am trying to convert a successful R code using Rpostgresql into a PL/R code in order to avoid pushing/pulling data in and out of the postgreql database.

Code is a dcast on a data.table:

#libs
library(RPostgreSQL);
library(data.table);

# connect
drv <- dbDriver("PostgreSQL");
con <- dbConnect(drv, dbname="postgres", user="postgres");

# load
cli_ranges <- dbGetQuery(con, "SELECT custid, prod_ranges, is_cli from cli_ranges;")

# DT 
setDT(cli_ranges )
setkeyv(cli_ranges , c("prod_ranges"))

# pivot
cli_ranges.pivoted <- dcast(cli_ranges, custid ~ paste0("is_cli_", prod_ranges), fun=sum, value.var = "is_cli")

# send back to DB
dbWriteTable(con, "cli_ranges_pivoted", cli_ranges.pivoted, row.names=F)

Code in R is working fine & fast.

I am now trying to push the code in a PL/R function,

CREATE OR REPLACE FUNCTION public.pivot()
  RETURNS void AS
$BODY$

[copy/paste R code]

$BODY$
  LANGUAGE plr;

... but the last line of the R code (dbWriteTable) throws :

ERROR:  R interpreter expression evaluation error
DETAIL:  Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'dbWriteTable' for signature '"logical", "character", "data.frame"'
CONTEXT:  In PL/R function pivot

Changing the data.table to a dataframe (as.data.frame(cli_ranges.pivoted)) is not working either.

One trick could be to return the data.table/frame in order to perform a CREATE TABLE cli_ranges_pivoted AS SELECT pivot(); but I don't really know how to push the data.frame as an output...

cli_ranges table :

custid  prod_ranges is_cli
1       A           1
1       B           1
1       C           0
2       A           1
2       B           0
2       C           1
3       A           0
3       B           1
3       C           0
4       A           1
...     ...         ...

After dcast (i.e. pivoting) datafram is as follow:

custid  prod_ranges_A   prod_ranges_B   prod_ranges_C
1       1               1               0
2       1               0               1
3       0               1               0
4       1               ...
...

Number of distinct values in prod_ranges changes often, so I can define in advance the number of columns after pivoting.

Env : Postgresql 9.5, R 3.3, PL/R 08.03.00.16, Win 10 64bits

1 Answer 1

2

You could start with http://gpdb.docs.pivotal.io/4330/ref_guide/pl_r.html Example 2 and 3.

Or, You could try modifying the unnest function like Lukas Eklund and Erwind Brandstetter did in this post (I am using Lukas' solution): Unnest array by one level.

CREATE OR REPLACE FUNCTION unnest_multidim(anyarray)
RETURNS SETOF anyarray AS
$BODY$
  SELECT array_agg($1[series2.i][series2.x]) FROM
    (SELECT generate_series(array_lower($1,2),array_upper($1,2)) as x, series1.i
     FROM 
     (SELECT generate_series(array_lower($1,1),array_upper($1,1)) as i) series1 
    ) series2
GROUP BY series2.i
$BODY$
LANGUAGE sql IMMUTABLE;

and then you could try returning an array, and do something like this:

CREATE OR REPLACE FUNCTION r_norm(n integer, mean float8, std_dev float8)
 RETURNS float8[] 
AS $$
 x<-rnorm(n,mean,std_dev); 
 y<-rnorm(n,mean,std_dev);
 final<-cbind(as.data.frame(x), as.data.frame(y));
return(final) 

$$ LANGUAGE 'plr';

CREATE TABLE test_norm_var AS SELECT R_output[1] as col1, R_output[2] as col2 FROM unnest_multidim(r_norm(10,0,1)) R_output;

SELECT col1 FROM test_norm_var;

EDIT

I could not get dbWriteTable to work the way it was intended as a PL/R function... BUT, You could also try this method

CREATE OR REPLACE FUNCTION pivot() 
  RETURNS VOID as $$

library(RPostgreSQL);
library(data.table);

drv <- dbDriver("PostgreSQL");
con <- dbConnect(drv, dbname ="postgres");

fields <- list(custid = "numeric",prod_ranges = "varchar(128)", is_cli = "numeric")

custid <- c(1,1,1,2,2,2)
prod_ranges <- c("A","B","C","A","B","C")
is_cli <- c(1,1,0,1,0,1)

cli_ranges <- data.frame(custid,prod_ranges,is_cli, stringsAsFactors = default.stringsAsFactors())

setDT(cli_ranges )
setkeyv(cli_ranges , c("prod_ranges"))

cli_ranges.pivoted <- as.data.frame(dcast(cli_ranges, custid ~ paste0("is_cli_", prod_ranges), fun=sum, value.var = "is_cli"))

create_query <- paste0("CREATE TABLE cli_ranges (",paste0(colnames(cli_ranges.pivoted), collapse = " numeric, "),
" numeric) DISTRIBUTED BY (",colnames(cli_ranges)[1],")")

dbGetQuery(con, create_query);

values_string <- "("
for ( i in 1:dim(cli_ranges.pivoted)[1]){
 for ( j in 1:dim(cli_ranges.pivoted)[2] ){
    if ( j != dim(cli_ranges.pivoted)[2]) { 
        values_string <- paste0(values_string,cli_ranges.pivoted[i,j],",")
    } else { 
        values_string <- paste0(values_string,cli_ranges.pivoted[i,j])
    }
  } 
    if ( i != dim(cli_ranges.pivoted)[1] ){
        values_string <- paste0(values_string,"),(")
    } else {
        values_string <- paste0(values_string,")")
    }
}

insert_query <- paste0("INSERT INTO cli_ranges (",paste0(colnames(cli_ranges.pivoted), collapse = ", "),
") VALUES ", values_string )

dbGetQuery(con, insert_query);

$$ LANGUAGE plr; 
Sign up to request clarification or add additional context in comments.

5 Comments

The problem is that the number of prod_ranges changes over time, i.e. I cannot define how much columns there will be resulting by the dcast (that transpose my table over the prod_ranges values)
Could you provide an example of your input to dcast or output (fill in fake numbers, etc.) ?
illustration provided in the original question
DISTRIBUTED BY is more for Pivotal/Greenplum/Redshift than postgresql, but thanks I got the idea!
Thanks for all the code, but I was actually looking for a more "direct" solution. I'd better use crosstab then.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.