2

I would like to write an R function that adds interaction terms to a formula.

For instance, the function takes the formula mpg ~ cyl + gear + disp, the treatment variable cyl and a character vector of control variables c("gear","disp") and returns mpg ~ cyl + cyl * gear + cyl * disp.

Ideally, the function should return an error if one of the control variables is not in the formula, or if the interaction term is already in the formula.

I came up with the following, which seems to work but uses string manipulation rather than first principles.

I think this makes it more prone to errors and slower.

How can I re-write it to use first principles?

#' Add interaction terms in a formula
#' 
#' @param form A formula
#' @param treat The treatment variable (string)
#' @param controls A character vector of control variables
#' @return A formula with interaction terms added between `treat` and each variable in `controls`
#' @export
#' @examples
#' reformulas_addints(mpg ~ cyl + gear, "cyl", c("gear"))
#' reformulas_addints(mpg ~ cyl + gear + disp, "cyl", c("gear", "disp"))
#' reformulas_addints(mpg ~ cyl + gear, "cyl", c("gears"))
#' reformulas_addints(mpg ~ cyl + cyl*gear, "cyl", c("gear"))
reformulas_addints <- function(form, treat, controls) {
  form_str <- as.character(form)
  for (control in controls) {
    if(!stringr::str_detect(form_str, control)){
      stop(paste0("The variable '", control, "' is not in the formula."))
    }
    patt <- paste0(r"(\s*)",treat,r"(\s*\*\s*)",control, r"(\s*)")
    if(stringr::str_detect(form_str, patt)){
      stop(paste0("The interaction '",treat, " * ", control, "' is already in the formula."))
    }
    form_str <- stringr::str_replace(
      form_str,
      paste0("\\b", control, "\\b"),
      paste0(treat, " * ", control)
    )
  }
  return(as.formula(form_str))
}

Here are some examples with expected output:

# Expected outputs 
reformulas_addints(mpg ~ cyl + gear, "cyl", c("gear"))
# mpg ~ cyl + cyl * gear
# also acceptable
# mpg ~ cyl + gear + cyl:gear
reformulas_addints(mpg ~ cyl + gear + disp, "cyl", c("gear", "disp"))
# mpg ~ cyl + cyl * gear + cyl * disp
# also acceptable
# mpg ~ cyl + gear + disp + cyl:gear + cyl:disp
reformulas_addints(mpg ~ cyl + gear + disp + hp, "cyl", c("gear", "disp"))
# mpg ~ cyl + cyl * gear + cyl * disp + hp
# also acceptable
# mpg ~ cyl + gear + disp + cyl:gear + cyl:disp + hp
# Notice that `hp` is _not_ interacted
reformulas_addints(mpg ~ cyl + gear, "cyl", c("gears"))
# Error: The variable 'gears' is not in the formula.
reformulas_addints(mpg ~ cyl + cyl*gear, "cyl", c("gear"))
# Error: The interaction 'cyl * gear' is already in the formula.
5
  • 3
    Frame challenge: why not simply construct the formula from scratch, using a flag to indicate if the interaction is required or not? Also, don't forget that you include an interaction with : as well as with *. Commented Aug 29 at 12:17
  • @Limey I'm writing a paper and first estimate a model without interaction, then I call update on that model with the new formula that contains the interactions. In this way when I change the main formula, also the formula with the interaction is changed. I guess I could write a function that generate the formula, but would overshoot it? Commented Aug 29 at 12:48
  • 1
    If you're happy to provide your column names ar character strings, I don't think you'll get a better solution than @user2554330's. If your workflow sits in the tidyverse, a solution using NSE would be fitting. But I think that would be difficult... Commented Aug 29 at 14:28
  • @Limey what is NSE? Commented Aug 29 at 18:44
  • NSE is Non Standard Evaluation. It’s the thing that lets you pass bare, unquoted, column names to tidyverse functions without getting an “object doesn’t exist” error. @g-grothendieck’s (now revised) original attempt at a solution gave a very elegant example of its use, though unfortunately it didn’t meet your (as then not clearly stated) requirements. Commented Aug 30 at 5:51

2 Answers 2

6

Use terms() to find out the items in your formula, and update() to modify it. I don't really know how low-level you want to go with "first principles", but this seems to do what you want:

reformulas_addints <- function(form, treat, controls) {
  stopifnot (inherits(form, "formula"))
  terms <- terms(form)
  
  variables <- attr(terms, "term.labels")
  
  stopifnot(controls %in% variables, treat %in% variables)
  
  for (control in controls) {
    new <- as.formula(paste("~ . + ", control, ":", treat))
    form <- update(form, new)
  }
  
  form
}

Here is an example:

f <- mpg ~ cyl + gear + disp

reformulas_addints(f, "cyl", c("gear", "disp"))
#> mpg ~ cyl + gear + disp + cyl:gear + cyl:disp

Created on 2025-08-29 with reprex v2.1.1

Sign up to request clarification or add additional context in comments.

3 Comments

This is acceptable, altough in the last of my examples it doesn't error out but still returns a valid formula. Thanks!
Is it possible to update this to error out in the last example?
Sure. Start with a null update on the input formula to put it in standard form, then after each update in the loop compare the updated formula to the previous one. If they are the same, then you didn't make any change, i.e. the new interaction was already there.
5

Stating the problem we have a formula which consists of a response variable, a treatment variable, control variables and other variables and we want that formula or equivalent but with the interactions between the treatment and control variables added.

We create an expression from the control variables by using reformulate and then take the second component which will be the right hand side as a expression. We then form a list L with that as the first component and treat as a name as the second. Next create the indicated formula, fo2, and use L to substitute in the control and test variables and convert to formula class giving fo3. Finally we ensure that fo3 has the same environment as the input formula fo and expand terms.

This outputs formula objects which are equivalent to those for the examples in the question, is short (except for error processing) and uses only formula and expression manipulation.

get_interactions <- function(fo) Filter(\(x) grepl(":", x), labels(terms(fo)))

add_interactions <- function(fo, treat, control) {

  on.exit({
    both <- intersect(get_interactions(fo4), get_interactions(fo))
    if (length(both)) stop("Interactions already in fo: ", toString(both))
  })

  notfound <- setdiff(control, all.vars(fo)[-1])
  if (length(notfound))  stop("Missing controls in fo: ", toString(notfound))

  L <- list(control = reformulate(control)[[2]], treat = as.name(treat))
  fo2 <- update(fo, . ~ . + treat * control)
  fo3 <- formula(do.call("substitute", list(fo2, L)))
  fo4 <- reformulate(labels(terms(fo3)), all.vars(fo3)[1], env = environment(fo))
  fo4
}

add_interactions(mpg ~ cyl + gear, "cyl", "gear")
## mpg ~ cyl + gear + cyl:gear
 
add_interactions(mpg ~ cyl + gear + disp, "cyl", c("gear", "disp"))
## mpg ~ cyl + gear + disp + cyl:gear + cyl:disp
 
add_interactions(mpg ~ cyl + gear + disp + hp, "cyl", c("gear", "disp"))
## mpg ~ cyl + gear + disp + hp + cyl:gear + cyl:disp

add_interactions(mpg ~ cyl + cyl * gear, "cyl", c("gear"))
## Error in add_interactions(mpg ~ cyl + cyl * gear, "cyl", c("gear")) : 
##  Interactions already in fo: cyl:gear

8 Comments

No, the function must take the treatment variable and the control variable. This is the specification. You can't interact the first variable with all the others.
I see what OP means, but some expected output in the original question would have helped make it more clear. This is very neat idea, though...
I have added a section with expected output
Have stated what I now assume is the question and provided revised answer.
no, control variables are not "all variables other than response and treatment". We have response variable, treatment variable, control variables we want to interact with treatment, and control variables we do not want to interact with treatment.
In any case this still uses string manipulation
Now that there are examples in the question it is clearer so I have revised the answer again.
Also added error processing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.