I have two large data frames that I would like to outer join with merge(), but the joined table is too large for RAM. My worked around is to use the RSQLite package to to the outer join and store the joined table back to the database.
I would like to use an R function on the columns in this joined table, but I can't figure out how to append a column to the joined table. I know how to do it with dbWriteTable() (shown below), but that's not an option as the joined table is larger than RAM.
library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
con <- dbConnect(dbDriver("SQLite"), dbname = tempfile())
dbWriteTable(con, "left_table", left, row.names = F)
dbWriteTable(con, "right_table", right, row.names = F)
dbGetQuery(con, "CREATE TABLE merged_table (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(con, "INSERT INTO merged_table SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
fun <- function(x) rowSums(x)
temp <- dbReadTable(con, "merged_table")
dbWriteTable(con, "merged_table_new", cbind(temp, fun(temp[, 2:3])))
dbDisconnect(con)
I have heard that data bases work on rows, so I suspect the correct solution may just cycle through the rows, appending an entry to each row, but I'm not sure how to implement. Thanks!
(And there's nothing sacred about SQLite here, I just thought that it would be better for this ad hoc analysis.)
Edit: I learned about the bind.data option in dbGetPreparedQuery() and realized that I need a read and a write connection to the database, but I am still having some problems (i.e., the data doesn't INSERT to the database). The script runs without error, but also without the desired result.
library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
my.tempfile <- tempfile()
con.write <- dbConnect(dbDriver("SQLite"), dbname = my.tempfile)
con.read <- dbConnect(dbDriver("SQLite"), dbname = my.tempfile)
dbWriteTable(con.write, "left_table", left, row.names = F)
dbWriteTable(con.write, "right_table", right, row.names = F)
dbGetQuery(con.write, "CREATE TABLE merged_table (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(con.write, "INSERT INTO merged_table SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
dbGetQuery(con.write, "ALTER TABLE merged_table ADD COLUMN sum INTEGER")
dbGetQuery(con.write, "ALTER TABLE merged_table ADD COLUMN mean INTEGER")
res <- dbSendQuery(con.read, "SELECT left_num, right_num FROM merged_table")
while (!dbHasCompleted(res)) {
data.1 <- fetch(res)
data.2 <- data.frame(rowSums(data.1), rowMeans(data.1))
dbGetPreparedQuery(con.write, "INSERT INTO merged_table (sum, mean) VALUES (?, ?)", bind.data = data.2)
}
dbClearResult(res)
dbGetQuery(con.read, "SELECT * FROM merged_table LIMIT 5")
gives
letters left_num right_num sum mean
1 a 1 21 NA NA
2 a 1 22 NA NA
3 a 1 23 NA NA
4 a 1 24 NA NA
5 a 1 25 NA NA
but I expected
left_num right_num sum mean
1 1 21 22 11.0
2 1 22 23 11.5
3 1 23 24 12.0
4 1 24 25 12.5
5 1 25 26 13.0