1

I am having a challenge with a piece of code that takes very long to execute and I am wondering what are the key tricks to optimize the execution time of this code. I have to admit that the input data.frame is significant (140,000 rows) and that the output data.frame is approximately 220,000 rows.

A sample of the input data.frame:

head(extremes)
X_BusinessIDDescription     min         max         month
ID105                       2007-12-01  2008-06-01  2007-12-01
ID206                       2007-12-01  2009-07-01  2007-12-01
ID204                       2007-12-01  2008-02-01  2007-12-01
ID785                       2008-07-01  2010-08-01  2008-07-01
ID125                       2007-11-01  2008-07-01  2007-11-01
ID107                       2007-11-01  2011-06-01  2007-11-01

The data.frame that will be extended with the loop. The data.frame is initiated to get the structure in place.

output <- extremes[1,]
output
X_BusinessIDDescription     min         max         month
ID105                       2007-12-01  2008-06-01  2007-12-01

Other values

IDcounter <- 1
IDmax <- nrow(extremes)
linecounter <- 1

The while-loop I would like to optimize:

while (IDcounter <= IDmax){
    start <- extremes$min[IDcounter]
    end <- extremes$max[IDcounter] # add three months
    while(start <= end){
        output[linecounter,] <- extremes[IDcounter,]
        output$month[linecounter] <- start
        linecounter <- linecounter+1
        start <- seq(start, by ="month", length=2)[2]
    }
    IDcounter <- IDcounter + 1
}

For a small number of rows this code executes pretty quickly, but it seems like it is slowing down as the output extends.

The output looks something like this:

head(output)
X_BusinessIDDescription     min         max         month
ID105                       2007-12-01  2008-06-01  2007-12-01
ID105                       2007-12-01  2008-06-01  2008-01-01
ID105                       2007-12-01  2008-06-01  2008-02-01
ID105                       2007-12-01  2008-06-01  2008-03-01
ID105                       2007-12-01  2008-06-01  2008-04-01
ID105                       2007-12-01  2008-06-01  2008-05-01

For every month in the interval between min and max in the extreme file is an row created.

I also would be interested to learn how I can can that this code can take ready of the multiple cores of computing resources available. OK, I admit this is not really an optimization but it will reduce the execution time, which is important as well.

Jochem

5
  • Is there a speed difference between for and while? Commented Nov 20, 2012 at 14:32
  • 1
    No there's no (significant) speed difference between for and while. Commented Nov 20, 2012 at 14:46
  • 2
    You're committing the cardinal sin of growing an object inside a loop. Initialize the entire 220k row data.frame first and insert the results in the relevant rows via subsetting. Commented Nov 20, 2012 at 15:10
  • 1
    What is the problem you are trying to solve? I see a rather gigantic matrix as your output, with hugely redundant entries! If you can tell us what you plan to do with the basic combinations of "IDxxx" and every month that ID value is valid, we can suggest a much simpler structure. E.g. output <- list({name all the unique IDxxx values}) , then each output$IDxxx <-{function which lists all the valid months} . Commented Nov 20, 2012 at 15:13
  • 1
    The plyr package and most importantly the data.table package will probably solve your problem. Commented Nov 20, 2012 at 15:15

1 Answer 1

2

As @CarlWitthoft already mentioned you have to rethink your data structure because of many duplicated data.

Here you find a simple vectorized approach:

  ## create all possible ranges of months
  ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)

  ## how many months per ID?
  n <- unlist(lapply(ranges, length))

  ## create new data.frame
  output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
                      min=rep(extremes$min, n),
                      max=rep(extremes$max, n),
                      month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)

Comparison to your approach:

extremes <- data.frame(X_BusinessIDDescription=c("ID105", "ID206", "ID204", "ID785", "ID125", "ID107"),
                      min=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
                      max=as.Date(c("2008-06-01", "2009-07-01", "2008-02-01", "2010-08-01", "2008-07-01", "2011-06-01")),
                      month=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
                      stringsAsFactors=FALSE)

approachWhile <- function(extremes) {
  output <- data.frame(X_BusinessIDDescription=NA, min=as.Date("1970-01-01"), max=as.Date("1970-01-01"), month=as.Date("1970-01-01"), stringsAsFactors=FALSE)
  IDcounter <- 1
  IDmax <- nrow(extremes)
  linecounter <- 1
  while (IDcounter <= IDmax){
    start <- extremes$min[IDcounter]
    end <- extremes$max[IDcounter] # add three months
    while(start <= end){
        output[linecounter,] <- extremes[IDcounter,]
        output$month[linecounter] <- start
        linecounter <- linecounter+1
        start <- seq(start, by ="month", length=2)[2]
    }
    IDcounter <- IDcounter + 1
  }
  return(output)
}

approachMapply <- function(extremes) {                       
  ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)

  n <- unlist(lapply(ranges, length))

  output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
                      min=rep(extremes$min, n),
                      max=rep(extremes$max, n),
                      month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)
  return(output)
}

identical(approachWhile(extremes), approachMapply(extremes)) ## TRUE

library("rbenchmark")

benchmark(approachWhile(extremes), approachMapply(extremes), order="relative")
#                      test replications elapsed relative user.self sys.self
#2 approachMapply(extremes)          100   0.176     1.00     0.172    0.000
#1  approachWhile(extremes)          100   6.102    34.67     6.077    0.008
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.