1

In R, I would like to create new variables in a data frame by making some computations between specific existing variables. The variable name of the new variables, and the particular existing variables to be used in the computations is (or should be ) defined by a regular expression.

I know the description is kind of confusing, so here an example with an imaginary data set where some variables (V1, V2, V3) were measured at 2 different time-points (T1, T2):

dataframe <- data.frame(matrix(rnorm(70), nrow=10))
names(dataframe) <- c("Subject", "V1_T1", "V1_T2", "V2_T1", "V2_T2", "V3_T1", "V3_T2")
dataframe$subject <-  factor(dataframe$Subject)

Now, for each subject, and each "Tn" (T1, T2, T3) I would like to generate a new variable (in the same data frame), which should be the result of an operation between different variables with the same "Tn". Here some pseudo-code to try to explain my needs a bit more clearly (I hope)

for i in c(T1, T2, T3){                            #For each timepoint (& Subject)...
    dataframe$V4_*i* <- V1_*i* + V2_*i* / V3_*i*   #Compute V4 = V1 + V2 / V3
}

This should result in several new V4_n variables (V4_T1, V4_T2, V4_T3) corresponding to the result of the V1 + V2 / V3 operation for each time-point Tn and each Subject.

In short, I would like to use regular expressions and for-loops to name and compute new variables, looping a predefined operation over variables specified by something like a regular expression. (It is not mandatory that I use for loops or regular expressions, If there are alternative methods to achieve what I want I would like to hear about them)

I have been toying a bit with the for-loop and regular expression documentation in R, but so far I have not been successful in producing the desired result. I can of course manually write down all required computations in regular R script, one by one, but that is not efficient at all (considering that the actual data-set where I need to apply this is far more complex than this one), and it is pretty annoying to have to copy-paste and edit the same piece of code several times over (also, more susceptible to typos and errors).

Any help/suggestions would be appreciated, thanks!

0

2 Answers 2

1

Since your example didn't entirely reflect your question, I took the liberty to create a new dataset which I think respect the spirit of your issue:

Let's assume df

   Subject       V1_T1       V1_T2      V2_T1       V2_T2       V3_T1       V3_T2
1        A  0.16694311  0.47190422  0.6571530  1.68428290  0.60685147  1.25383252
2        B  0.45561405  1.01849804  1.6041593 -1.40256942  1.50029772  1.34857932
3        C  0.31762739 -0.78986513 -0.8054005 -0.14714956 -0.63612792 -0.13565903
4        D  0.66536682 -0.57231682  0.1362731  0.03632215 -0.82147539  0.42349920
5        E  0.09113996  0.73319950  0.1046914 -0.75730274 -0.72833574  0.08412158
6        F  0.01751232 -0.78021331 -0.9158299 -0.68345547 -0.08848462 -0.18618554
7        G -0.96602939  1.08286247  0.6116285  0.08982368  0.12721634  0.71738577
8        H -1.06444232 -0.03971332 -0.5394623 -1.34349634 -0.76919950 -3.01150549
9        I -0.83680136 -0.54609901 -0.1261597 -1.13312110  0.23785615  0.85203224
10       J  1.98656695 -0.01522142  0.7850551  0.93551804 -0.26279470 -0.80375911

For each Subject, create two new columns V4_T1 and V4_T2 being the result (V1 + V2) / V3 for their respective Tn value.


You could restructure your data in a long format using gather(), then separate() the initial column names in two distinct columns, spread() back the result in a wide format to perform operations on each Subject & Tn combinaison and store in V4 using mutate(). Then we gather() one last time to unite() the columns and spread back the result to achieve your desired output:

library(tidyr)
library(dplyr)

df %>%
  gather(key, value, -Subject) %>%
  separate(key, c("V", "T")) %>%
  spread(V, value) %>%
  mutate(V4 = (V1 + V2) / V3) %>%
  gather(key, value, -(Subject:T)) %>%
  unite(R, key, T) %>%
  spread(R, value)

Which gives:

   Subject       V1_T1       V1_T2      V2_T1       V2_T2       V3_T1       V3_T2
1        A  0.16694311  0.47190422  0.6571530  1.68428290  0.60685147  1.25383252
2        B  0.45561405  1.01849804  1.6041593 -1.40256942  1.50029772  1.34857932
3        C  0.31762739 -0.78986513 -0.8054005 -0.14714956 -0.63612792 -0.13565903
4        D  0.66536682 -0.57231682  0.1362731  0.03632215 -0.82147539  0.42349920
5        E  0.09113996  0.73319950  0.1046914 -0.75730274 -0.72833574  0.08412158
6        F  0.01751232 -0.78021331 -0.9158299 -0.68345547 -0.08848462 -0.18618554
7        G -0.96602939  1.08286247  0.6116285  0.08982368  0.12721634  0.71738577
8        H -1.06444232 -0.03971332 -0.5394623 -1.34349634 -0.76919950 -3.01150549
9        I -0.83680136 -0.54609901 -0.1261597 -1.13312110  0.23785615  0.85203224
10       J  1.98656695 -0.01522142  0.7850551  0.93551804 -0.26279470 -0.80375911
         V4_T1      V4_T2
1    1.3579865  1.7196771
2    1.3729097 -0.2847970
3    0.7667846  6.9071309
4   -0.9758538 -1.2656332
5   -0.2688751 -0.2865285
6   10.1522452  7.8613452
7   -2.7858123  1.6346660
8    2.0851608  0.4593084
9   -4.0485020 -1.9708410
10 -10.5467198 -1.1449906
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for the very thorough response! That does it.
0

Try a data.table solution:

library(data.table)
setDT(dataframe)


# define name of new columns to create
  cols <- noquote(paste0("V4_T",1:4))


dataframe[ , (cols) := lapply(list(1:4), function(x)  get(paste0("V1_T", x)) + get(paste0("V2_T", x)) / get(paste0("V3_T", x)))  ]

1 Comment

Thanks as well for your response! It is neat and does the trick. I am only accepting the previous response as an answer because it was first, but this solution works just as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.