0

I apologize if this is a duplicate or a bit confusing - I've searched all around SO but can't seem to apply find what I'm trying to accomplish. I haven't used functions/loops extensively, especially writing from scratch, so I'm not sure if the error is from the function (likely) or from the construct of the data. The basic flow as follows:

Dummy data set - grouping, type, rate, years, months

I'm running lm formula on the data set by grouping with this bit:

coef_models <- test_coef %>% group_by(Grouping) %>% do(model = lm(rate ~ years + months, data = .))

The result of the above gives me intercepts and coefficients for the variables - what I'm trying to accomplish next (and failing) is for all the coefficients for the estimates that are negative, drop that component out of the equation and rerun the lm with just the positive coefficient. So for example a grouping of states, if the years coefficient is negative, I would want to run lm(rate ~ months, data = . with in the formula.

To get there, with plyr/broom, I'm taking the results and putting them into a data frame:

#removed lines with negative coefficients
library(dplyr)
library(broom)
coef_output_test <- as.data.frame(coef_models %>% tidy(model))
coef_output_test$Grouping <- as.character(coef_output_test$Grouping)
#drop these coefficients and rerun
coef_output_test_rerun <- coef_output_test[!(coef_output_test$estimate >= 0),]

From here, I'm trying to rerun the groupings with issues without the negative variable from the initial run. Because the variables will vary, some instances will be years dropping out, some will be months, I need to pass through the correct column to use. I think this is where I'm getting hung up:

lm_test_rerun_out <- data.frame(grouping=character()
                            , '(intercept)'=double()
                            , term=character()
                            , estimate=double()
                            , stringsAsFactors=FALSE)    
lm_test_rerun <- function(r) {    
y = coef_output_test_rerun$Grouping
x = coef_output_test_rerun$term
for (i in 2:nrow(coef_output_test_rerun)){
    lm_test_rerun_out <- test_coef %>% group_by(Grouping["y"]) %>% do(model = lm(rate ~ x, data = .))
  }
}
lm_test_rerun(coef_output_test_rerun)

I get this error:

variable lengths differ (found for 'x')

The output for function should be something like this dummy output:

Grouping, Term, (intercept), Estimate
Sports, Years, 0.56, 0.0430
States, Months, 0.67, 0.340

I'm surely not fluent in R, and I'm sure the parts above that do work could be done more efficiently, but the output of the function should be the grouping and x variable used, along with the intercept and estimate for each. Ultimately I'll be taking that output and appending back to the original 'coef_models' - but I can't get past this part for now.

EDIT: sample test_coef set

        Grouping    Drilldown   Years   Months  Rate
    Sports  Basketball  10  23  0.42
    Sports  Soccer  13  18  0.75
    Sports  Football    9   5   0.83
    Sports  Golf    13  17  0.59
    States  CA  13  20  0.85
    States  TX  14  9   0.43
    States  AK  14  10  0.63
    States  AR  10  5   0.60
    States  ID  18  2   0.22
Countries   US  8   19  0.89
Countries   CA  9   19  0.86
Countries   UK  2   15  0.64
Countries   MX  21  15  0.19
Countries   AR  8   11  0.62
5
  • 1
    Can you dput(test_coef) and post the results here to make it reproducible? Commented Mar 23, 2018 at 19:27
  • 1
    Have you looked into constrained GLMs? stat.washington.edu/handcock/combining/software/glmc.html. This approach seems suspiciously similar to p-hacking or stepwise selection, and I'm not sure the results will be valid inferences. Commented Mar 23, 2018 at 19:28
  • Your parameter for lm_test_rerun is r but you never use it in the function. Maybe you meant lm_test_rerun <- function(coef_output_test_rerun){... instead. Commented Mar 23, 2018 at 19:29
  • Right - yeh the reasoning for dropping and rerunning is kind of a precursor to optimizing best fit models over larger datasets. This is kind of a first run at the ability to handle these negative issues. I'll post a sample dataset for test_coef as well. Commented Mar 23, 2018 at 19:33
  • If they're both negative, I'll drop the group for now. If they're both positive, they will be modeled in the first run through here: coef_models <- test_coef %>% group_by(Grouping) %>% do(model = lm(rate ~ years + months, data = .)) Commented Mar 23, 2018 at 20:24

1 Answer 1

1

Consider a base R solution with by that slices dataframe by one or more factors for any extended method to run on each grouped subset. Specifically, below will conditionally re-run lm model by checking coefficient matrix and ultimately returns a dataframe with needed values:

Data

txt <- '        Grouping    Drilldown   Years   Months  Rate
    Sports  Basketball  10  23  0.42
    Sports  Soccer  13  18  0.75
    Sports  Football    9   5   0.83
    Sports  Golf    13  17  0.59
    States  CA  13  20  0.85
    States  TX  14  9   0.43
    States  AK  14  10  0.63
    States  AR  10  5   0.60
    States  ID  18  2   0.22
Countries   US  8   19  0.89
Countries   CA  9   19  0.86
Countries   UK  2   15  0.64
Countries   MX  21  15  0.19
Countries   AR  8   11  0.62'

test_coef <- read.table(text=txt, header=TRUE)

Code

df_list <- by(test_coef, test_coef$Grouping, function(df){
  # FIRST MODEL
  res <- summary(lm(Rate ~ Years + Months, data = df))$coefficients

  # CONDITIONALLY DEFINE FORMULA
  f <- NULL
  if ((res["Years",1]) < 0 & (res["Months",1]) > 0) f <- Rate ~ Months
  if ((res["Years",1]) > 0 & (res["Months",1]) < 0) f <- Rate ~ Years 

  # CONDITIONALLY RERUN MODEL
  if (!is.null(f)) res <- summary(lm(f, data = df))$coefficients

  # ITERATE THROUGH LENGTH OF res MATRIX SKIPPING FIRST ROW
  tmp_list <- lapply(seq(length(res[-1,1])), function(i)
    data.frame(Group = as.character(df$Grouping[[1]]), 
               Term = row.names(res)[i+1],
               Intercept = res[1,1],
               Estimate = res[i+1,1])
  )

  # RETURN DATAFRAME OF 1 OR MORE ROWS
  return(do.call(rbind, tmp_list))
})

final_df <- do.call(rbind, unname(df_list))
final_df

#       Group   Term  Intercept    Estimate
# 1 Countries Months -0.0512500  0.04375000
# 2    Sports  Years  0.6894118 -0.00372549
# 3    States Months  0.2754176  0.02941113

Do note: removing negative coeff of first and re-running new model can render the other component negative when previously it was positive.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you very much - I do believe this will work and I appreciate you taking the time to run through it. I did have the 'by' method initially and switched over to what I posted. I'm having trouble formatting my data to what you have in your answer - copying and pasting your script does give me the same results - but when i try to format my data i just get List,4 in a matrix.
How does your data structure differ from posted? What did you change in this solution?
My original file had spaces in the headers which I've removed - but even updating the headers and redoing everything again- the only error I keep receiving is 'Error in as.data.frame.default(data) : cannot coerce class ""by"" to a data.frame'
Ok I reran everything once more and it works - I'm sure I had an error my first attempts.
Possibly something in your environment (maybe in a test run) affected first attempt. Always start from clean environment with debugging tests. But glad to help. Happy coding!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.