1

Let's imagine I have a data frame containing 2 types information (X# and Y#).

df = data.frame(matrix(rnorm(600), nrow=100))
colnames(df) <- c("X1", "X2", "Y1", "Y2", "Y3", "Y4")

I use two columns (below X1 and Y1) to group them in 9 categories (each column being splitted in 3 categories containing 1/3 or the rows) and store them in a new column cat11 (I deeply apologize for the poor code I show you, but I am a just a beginner in R).

df$tmpx <- cut2(df$X1, g=3)
levels(df$tmpx) <- c(1,2,3)
df$tmpy <- cut2(df$Y1, g=3)
levels(df$tmpy) <- c(1,2,3)

enum <- 1
for (x in sort(unique(df$tmpx)))
{
  for (y in sort(unique(df$tmpy)))
  {
    print(enum)
    df$cat11[df$tmpx == x & df$tmpy == y] <- enum
    enum <- enum + 1
  }
}

What I am struggling to do now is to run this code for a selection of other combinations (e.g X1,Y4 > cat14; X2,Y1 > cat21; X2,Y3 > cat23).

I have been trying using function as well as lapply, but unsuccessfully yet. I think I am missing something obvious.

Any help would be much appreciated.

1 Answer 1

1

First I create all combinations of X and Y columns:

combs <- expand.grid(names(df)[grep("X", names(df))],
                     names(df)[grep("Y", names(df))],
                     stringsAsFactors = FALSE)
#  Var1 Var2
#1   X1   Y1
#2   X2   Y1
#3   X1   Y2
#4   X2   Y2
#5   X1   Y3
#6   X2   Y3
#7   X1   Y4
#8   X2   Y4

Then I write a vectorized alternative to your approach and wrap it in a function:

library(Hmisc)
fun <- function(DF, col1, col2) {
  tmpx <- cut2(df[[col1]], g=3)
  tmpx <- as.integer(tmpx)

  tmpy <- cut2(df[[col2]], g=3)
  tmpy <- as.integer(tmpy)

  (tmpx - 1) * 3 + tmpy #some simple maths
}

Note how I use [[ to extract columns given as character strings programmatically. You can't use $ for this (this is a FAQ). Study help("[").

Then I use mapply to apply the function to all combinations:

df[, paste0("cat", 
            gsub("[[:alpha:]]*", "", combs[,1]),
            gsub("[[:alpha:]]*", "", combs[,2]))] <- mapply(fun, combs[,1], combs[,2], 
                                                             MoreArgs = list(DF = df))

mapply loops over all elements of its arguments and applies a function to them. E.g., the function is applied to X1/Y1, X2/Y1, ...

The most complicated part is creating the column names. I use a simple regular expression here and just remove all letters from the column names given in combs.

Sign up to request clarification or add additional context in comments.

4 Comments

This is a very elegant solution. A quick question: inside fun, would there be any difference in using df[,col1] instead of df[[col1]] to extract columns programmatically?
No, both would work (I think the latter is slightly more efficient).
Thanks Roland, very well done and explained. I still have a question. Why did you placed the "library(Hmisc)" within the function ?
@user3541159 You can put it outside the function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.