0

I have a table such as:

employee <- c('John Doe','Peter Gynn','Jolie Hope') 
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))

employ.data <- data.frame(employee, salary, startdate)
employ.data$Salary.category <- NA
employ.data$Salary.next.year <- 0

I am looking to add more columns. the values of each columns will be a function of the value of the salary.

I have created the following loop:

for (employee in 1:nrow(employ.data)){
  if (is.na(employ.data[employee,2]) == FALSE){
    if(employ.data[employee,2] <=  22000){
      employ.data[employee,4] = "Sub 22k"
      employ.data[employee,5] = employ.data[employee,2] * 1.20
    } else if(employ.data[employee,2] >  22000 && employ.data[employee,2] <=  23000){
      employ.data[employee,4] = "Sub 23k"
      employ.data[employee,5] = employ.data[employee,2] * 1.10
    } else if(employ.data[employee,2] >  23000){
      employ.data[employee,4] = "Sub 24k"
      employ.data[employee,5] = employ.data[employee,2] * 1.10
      }
  }
}

It works well, as the df result is :

> employ.data
    employee salary  startdate Salary.category Salary.next.year
1   John Doe  21000 2010-11-01         Sub 22k            25200
2 Peter Gynn  23400 2008-03-25         Sub 24k            25740
3 Jolie Hope  26800 2007-03-14         Sub 24k            29480

The issue is that in the actual table i have about 5 columns to add on over 1mln rows, and this takes abour 2hr running. Is there a better way to build the additional table with Apply's for example?

1 Answer 1

2

This should be much faster. When I test on your dataset copied to be 150,000 rows, your loop takes 431 seconds on my old laptop but the code below takes 0.35 seconds, >1200x faster. Switching to run the same code using the high-performance collapse library increases it another 3x faster, to be about 4,000x as fast. Hopefully should get your runtime down to a couple seconds.

library(dplyr)
# options(collapse_mask = "manip")  # optional, to use faster "collapse"
# library(collapse)                 # library to run code below
employ.data %>%
  mutate(Salary.category = case_when(salary <= 22000 ~ "Sub 22k",
                                     salary <= 23000 ~ "Sub 23k",
                                     TRUE ~ "Sub 24k"),
         Salary.next.year = salary * case_when(salary <= 22000 ~ 1.2,
                                               salary <= 23000 ~ 1.1, # redundant
                                               TRUE ~ 1.1))

Larger data set:

employ.data %>% tidyr::uncount(50000) -> employ.data   # 150k rows instead of 3k

The fundamental slowdown is that your code is not vectorized, which means you are asking R to redo some compiling steps with every row, instead of doing them once and applying the calculation to all rows of data. Here's a good explanation of how that works: https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.