Add MySQL query results to R dataframe

Question

I want to convert a MySQL query from a python script to an analogous query in R. The python uses a loop structure to search for specific values using genomic coordinates:

SQL = """SELECT value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
                        WHERE `chrom` = %d AND `site` = %d""" % (Table, Chr, Start)
cur.execute(SQL)

In R the chromosomes and sites are in a dataframe and for every row in the dataframe I would like to extract a single value and add it to a new column in the dataframe

So my current dataframe has a similar structure to the following:

df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))

The amended dataframe should have an additional column with values from the database (at corresponding genomic coordinates. The structure should be similar to:

df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300), "Value"=c(1.5, 0, 5, 60, 100)

So far I connected to the database using:

con <- dbConnect(MySQL(),
                 user="root", password="",
                 dbname="MyDataBase")

Rather than loop over each row in my dataframe, I would like to use something that would add the corresponding value to a new column in the existing dataframe.

Update with working solution based on answer below:

library(RMySQL)
con <- dbConnect(MySQL(),
                 user="root", password="",
                 dbname="MyDataBase")

GetValue <- function(DataFrame, Table){
  queries <- sprintf("SELECT value as value 
                     FROM %s FORCE INDEX (chrs) FORCE INDEX (sites) 
                     WHERE chrom = %d AND site = %d UNION ALL SELECT 'NA' LIMIT 1", Table, DataFrame$Chr, DataFrame$start)
  res <- ldply(queries, function(query) { dbGetQuery(con, query)})
  DataFrame[, Table] <- res$value
  return(DataFrame)
}
df <- GetValue(df, "TableName")

Can you explain in words what the result is suppose to look like? — Roman Luštrik
– Roman Luštrik, Commented Jun 18, 2014 at 14:07

John St. John · Accepted Answer · 2014-08-06 01:40:14Z

1

+50

Maybe you could do something like this. First, build up your queries, then execute them, storing the results in a column of your dataframe. Not sure if the do.call(rbind part is necessary, but that basically takes a bunch of dataframe rows, and squishes them together by row into a dataframe.

queries=sprintf("SELECT value as value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites) WHERE chrom = %d AND site = %d UNION ALL SELECT 0 LIMIT 1", "TableName", df$Chrom, df$Pos)
df$Value = do.call("rbind",sapply(queries, function(query) dbSendQuery(mydb, query)))$value

I played with your SQL a little, my concern with the original is with cases where it might return more than 1 row.

edited Aug 6, 2014 at 1:40

answered Aug 2, 2014 at 4:49

John St. John

1,6021 gold badge14 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user2165857 Over a year ago

I was able to get it somewhat working by using:

queries <- sprintf("SELECT coalesce(value,0) as value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites) WHERE chrom = %d AND site = %d LIMIT 1", "TableName", df$Chrom, df$Pos)

and then dbGetQuery(con, queries). However, this only gives the first result. If I do the second part of your code df$Value <- do.call("rbind",dbGetQuery(con, queries))$value I get an error "Error in do.call("rbind", dbGetQuery(con, queries))$value : $ operator is invalid for atomic vectors"

user2165857 Over a year ago

If I use dbSendQuery as you suggest the error I get is: "Error in do.call("rbind", dbSendQuery(con, queries)) : second argument must be a list"

John St. John Over a year ago

Updated my response with an sapply.

user2165857 Over a year ago

That works! One more thing though that I forgot to add to my original question. Some searches won't be found in my database so I have several queries which will have no database entries. Is there a way to fill in "NA" for those entries?

John St. John Over a year ago

That is what the coalesce(value,0) part of the query should be doing. Basically it is saying either do value, or if null do 0. You can replace the 0 with whatever you want. w3resource.com/mysql/comparision-functions-and-operators/…

|

marc_s · Accepted Answer · 2015-03-31 16:06:57Z

1

I like the data.table package for this kind of tasks as its syntax is inspired by SQL

require(data.table)

So an example database to match the values to a table

table <- data.table(chrom=rep(1:5, each=5), 
                    site=rep(100*1:5, times=5), 
                    Value=runif(5*5))

Now the SQL query can be translated into something like

# select from table, where chrom=Chr and site=Site, value
Chr <- 2
Site <- 200
table[chrom==Chr & site==Site, Value] # returns data.table
table[chrom==Chr & site==Site, ]$Value # returns numeric

Key (index) database for quick lookup (assuming unique chrom and site..)

setkey(table, chrom, site)
table[J(Chr, Site), ]$Value # very fast lookup due to indexed table

Your dataframe as data table with two columns 'Chr' and 'Site' both integer

df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
dt <- as.data.table(df) # adds data.table class to data.frame
setkey(dt, Chr, Site) # index for 'by' and for 'J' join

Match the values and append in new column (by reference, so no copying of table)

# loop over keys Chr and Site and find the match in the table
# select the Value column and create a new column that contains this
dt[, Value:=table[chrom==Chr & site==Site]$Value, by=list(Chr, Site)]
# faster:
dt[, Value:=table[J(Chr, Site)]$Value, by=list(Chr, Site)]
# fastest: in one table merge operation assuming the keys are in the same order
table[J(dt)]

kind greetings

edited Mar 31, 2015 at 16:06

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Jun 18, 2014 at 16:11

Walter

3532 silver badges7 bronze badges

2 Comments

user2165857 Over a year ago

This looks great. However I'm trying to access an already existing MySQL database with many tables I will have to access multiple times. I'm not sure how this will increase the efficiency...

Walter Over a year ago

i believe that if the SQL db is sorted by your indexes chrom and site the keying is not so painfull. Without keying you must use the double condition selection a==A&b==B which is slow. If its possible you could keep the tables in memory to not lose the indexing step.

user2443147 · Accepted Answer · 2014-08-03 11:15:37Z

0

Why don't you use the RMySQL or sqldf package?

With RMySQL, you get MySQL access in R.

With sqldf, you can issue SQL queries on R data structures.

Using either of those, you do not need to reword you SQL query to get the same results.

Let me also mention the data.table package, which lets you do very efficient selects and joins on your data frames after converting them to data tables using as.data.table(your.data.frame). Another good thing about it is that a data.table object is a data.frame at the same time, so all your functions that work on the data frames work on these converted objects, too.

answered Aug 3, 2014 at 11:15

user2443147

Comments

Jot eN · Accepted Answer · 2014-08-05 14:15:36Z

0

You could easily use dplyr package. There is even nice vignette about that - http://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html.

One thing you need to know is:

You can connect to MySQL and MariaDB (a recent fork of MySQL) through src_mysql(), mediated by the RMySQL package. Like PostgreSQL, you'll need to provide a dbname, username, password, host, and port.

answered Aug 5, 2014 at 14:15

Jot eN

6,4566 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Add MySQL query results to R dataframe

4 Answers 4

8 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related