1

I need help to create variables based on regular expressions.

This is my dataframe:

df <- data.frame(a=c("blue", "red", "yellow", "yellow", "yellow", "yellow", "red"), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato", NA))

Basically, what I want to do is this, but in one step:

regx_1 <- as.numeric(grep("^[a-z]{5}$", df$b))
regx_2 <- as.numeric(grep("^[a-z]{6,}$", df$b))
df$fruit_1 <- NA
df$fruit_1[regx_1 + 1] <- as.character(df$b[regx_1])

df$fruit_2 <- NA
df$fruit_2[regx_2 + 1] <- as.character(df$b[regx_2])

Here is my try:

regex1 <- "^[a-z]{5}$"
regex2 <- "^[a-z]{6,}$"
regex <- c(regex1, regex1)

make_non_matches_NA <- function(vec, pattern){
  df[[newvariable]] <- NA
  df[[newvariable]][as.numeric(grep(pattern, vec)) + 1] <- as.character(vec[as.numeric(grep(pattern, vec))])
  return(newvariable)
}

df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)

EDIT: Why is my approach wrong? (Please note that the actual problem is bigger, so I have to stick to an approach, where a repetition of a pattern should be avoided)

Any help is much appreciated!

2 Answers 2

2

Having numbered items in a your workspace is a good sign that they really belong to a list, so they are formally linked and we can work with them much more easily. So let's do that first.

regex <- c("^[a-z]{5}$", "^[a-z]{6,}$")

Our core functionality is to copy a source vector, but remove elements that don't match, and leave NA in their place, so we'll make a function for that, and we'll name it explicitly so we understand intuitively what it's doing (and as will our colleagues next reader on SO ;) ) :

make_non_matches_NA <- function(vec, pattern){
  # logical indices of matches
  matches_lgl <- grepl(pattern, vec)
  # the elements which don't match should be NA
  vec[!matches_lgl] <- NA
  # resulting vector should be returned
  vec
}

Let's test this with first pattern

make_non_matches_NA(df$b, regex[[1]])
#> [1] apple <NA>  peach lemon <NA>  <NA> 
#> Levels: apple lemon orange peach pineapple tomato

So far so good! now let's test it with all regex, we avoid for loops when we can generally in R because we have clearer tools like lapply(). Here I want to apply this function to all regex expressions :

lapply(regex, make_non_matches_NA, vec = df$b)
#> [[1]]
#> [1] apple <NA>  peach lemon <NA>  <NA> 
#> Levels: apple lemon orange peach pineapple tomato
#> 
#> [[2]]
#> [1] <NA>      orange    <NA>      <NA>      pineapple tomato   
#> Levels: apple lemon orange peach pineapple tomato

Great, it works!

But I want this in my data.frame, not as a separate list, so I will assign this result to the relevant names in my df directly

df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)
# then print my updated df
df
#>   a         b fruit1    fruit2
#> 1 1     apple  apple      <NA>
#> 2 2    orange   <NA>    orange
#> 3 3     peach  peach      <NA>
#> 4 4     lemon  lemon      <NA>
#> 5 5 pineapple   <NA> pineapple
#> 6 6    tomato   <NA>    tomato

tada!

Sign up to request clarification or add additional context in comments.

11 Comments

Thank you for your answer Moody! Your answers are both strongly related to the specific problem, so it I my mistake that I asked wrong. In my data, one step is more complicated so I think I need to spot this complicated pattern of the 2 step procedure and use this pattern to form an execution of this multiple steps (not necessary a "for loop"). So to be more precise, need more or less an explanation why my approach doesn't work.
@stefan485 I hope my new answer works better. I tried to detail the steps, and more importantly, the reasoning. R can be frustrating at first but what helps is to cut the issue in small pieces and make sure you 100% understand each step, it will save you a lot of time in the future.
Oh yeah I see! The function then seams to work but the loop over is still broken. i.e. df <- create.function(df, "b", "fruit_1", regx_1) now works separately.
Can I ask you once again please why my updated try to a similar problem does not work? :/
1) You're right, vec is fed explicit so regex is passed, element by element, to the next argument, which is pattern
|
1

I don't if this qualifies as "at one step" but you could try mutate from the dplyr package:

df <- data.frame(a=c(1:6), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato"), 
                 stringsAsFactors = FALSE)

Note that I set stringsAsFactors = FALSE inside data.frames.

dplyr::mutate(df, fruit_1 = if_else(grepl("^[a-z]{5}$", b), b, NA_character_),
       fruit_2 = if_else(grepl("^[a-z]{6}$", b), b, NA_character_))

  a         b fruit_1 fruit_2
1 1     apple   apple    <NA>
2 2    orange    <NA>  orange
3 3     peach   peach    <NA>
4 4     lemon   lemon    <NA>
5 5 pineapple    <NA>    <NA>
6 6    tomato    <NA>  tomato

1 Comment

Thank you Cettt! Your answer does indeed help me, but I can't adopt it to my code, because it is not flexible enough and I can only make some parts of the code shorter. I have to put my question differently: How can I loop over a pattern of some code where the changes are summarised in a function or similarly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.