0

I am trying to extract and join 2 data frames based on some date parts but its not working. The data frames are as follows :-

startdf

startperiod
2015-10-01
2016-10-01
2017-10-01
2018-10-01


enddf

endperiod
2016-03-31
2017-03-31
2018-03-31

Both startperiod and endperiod are of 'Date' data type

This is final output I desire :-

startperiod, endperiod
2015-10-01  2016-03-31
2016-10-01  2017-03-31
2017-10-01  2018-03-31
2018-10-01  Null

The equivalent SQL would be something like this :-

Select startperiod, endperiod
From startdf a lef join enddf b
On year(b.endperiod) = (year(a.startperiod) + 1)

is there a way to do in R? I believe I need to use library sqldf and RH2 but I couldn't get it going no matter what I did.

Simplistically, this should work but doesn't!

sqldf("Select * from startperioddf a where year(startperiod) = 2016")

2 Answers 2

1

1) RH2 Assuming

  • the data shown in reproducible form in the Note below. In particular, note that startdate and enddate are assumed to be of Date class.
  • typos in the question are fixed
  • use of h2 database backend instead of the default sqlite

then your code works:

library(sqldf)
library(RH2)

sql <- "Select startperiod, endperiod
  From startdf a left join enddf b
  On year(b.endperiod) = (year(a.startperiod) + 1)"
sqldf(sql)

giving:

  startperiod  endperiod
1  2015-10-01 2016-03-31
2  2016-10-01 2017-03-31
3  2017-10-01 2018-03-31
4  2018-10-01       <NA>

Also

sqldf("Select * from startdf a where year(startperiod) = 2016")

giving:

  startperiod
1  2016-10-01

Be sure to read the material on the sqldf github site: https://github.com/ggrothendieck/sqldf

2) sqlite If you want to use the default sqlite backend then be sure that RH2 is NOT loaded (otherwise, it will assume you want to use it) and note that Date class variables will be uploaded to sqlite as integers representing the number of days since the unix epoch (since there is no Date class type in sqlite) so we need to convert days since the epoch to years (which can be done using strftime as shown).

sql2 <- "Select startperiod, endperiod
  From startdf a left join enddf b
  On strftime('%Y', b.endperiod * 3600 * 24, 'unixepoch') + 0 = 
     strftime('%Y', a.startperiod * 3600 * 24, 'unixepoch') + 1"
sqldf(sql2)

sqldf("Select * from startdf a 
  where strftime('%Y', a.startperiod * 3600 * 24, 'unixepoch') = '2016'")

Note

Lines1 <- "
startperiod
2015-10-01
2016-10-01
2017-10-01
2018-10-01"

Lines2 <- "
endperiod
2016-03-31
2017-03-31
2018-03-31"

startdf <- read.table(text = Lines1, header = TRUE, colClasses = "Date")
enddf <- read.table(text = Lines2, header = TRUE, colClasses = "Date")
Sign up to request clarification or add additional context in comments.

7 Comments

G.Grothendieck <- SQLite version worked. Thanks for that. The other one didn't. There was error - "Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.lang.NoClassDefFoundError: org.h2.jdbc.JdbcConnection. Error in !dbPreExists : invalid argument type". Any idea how can I fix this? I would prefer to use the RH2 version. Also, how do you get to insert the newline character in the original question, I tried but couldn't. Can you please point me to some place where I can get the editing right? <- Deepak
Start up a new instance of R and ensure that R and all packages are up to date and that you have Java installed. Then enter the code in the Note and then enter the code in the body of answer in point (1) and you will get the output that I pasted into the answer.
Thank you so much. I will try that now and report back!
Just tried on a new instance of R Studio and comes up with the same error. Errors are like so. "Error: package or namespace load failed for ‘rJava’: .onLoad failed in loadNamespace() for 'rJava', details: call: fun(libname, pkgname) error: JAVA_HOME cannot be determined from the Registry". <- Is this to do with our instalation constraints on our work laptops? <- Deepak
I can't reproduce that. I tried it under R using Windows, under RStudio using Windows and under R using Linux and it gave the answer shown in my answer all 3 times. Are you sure you followed the instructions in my prior comment exactly?
|
0

The sqldf package in R uses the SQLite database engine by default. Hence, you cannot use the year function in your query to extract the year part from the date. The following query will do the job:

sqldf("Select * from startdf where strftime('%Y', startperiod) = '2016'")

It uses SQLite's strftime function to compare specific date parts. The year function is defined under MySQL so you may have to install the RMySQL package and then use the drv = 'MySQL' argument to specify the database engine that you want sqldf to use.

3 Comments

Siddharth <- Will try that. What about joining those 2 data frames as shown in my original query where endperiod year = startperod year + 1. How would you do that?
It didn't return any rows <- sqldf("Select * from startperioddf where strftime(startperiod, '%Y') = '2016'") <- [1] startperiod newdate <0 rows> (or 0-length row.names)
Tried both ways with '%Y' before and after the variable name - same result as above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.