Create variables based on regular expressions with a loop in r

Question

I need help to create variables based on regular expressions.

This is my dataframe:

df <- data.frame(a=c("blue", "red", "yellow", "yellow", "yellow", "yellow", "red"), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato", NA))

Basically, what I want to do is this, but in one step:

regx_1 <- as.numeric(grep("^[a-z]{5}$", df$b))
regx_2 <- as.numeric(grep("^[a-z]{6,}$", df$b))
df$fruit_1 <- NA
df$fruit_1[regx_1 + 1] <- as.character(df$b[regx_1])

df$fruit_2 <- NA
df$fruit_2[regx_2 + 1] <- as.character(df$b[regx_2])

Here is my try:

regex1 <- "^[a-z]{5}$"
regex2 <- "^[a-z]{6,}$"
regex <- c(regex1, regex1)

make_non_matches_NA <- function(vec, pattern){
  df[[newvariable]] <- NA
  df[[newvariable]][as.numeric(grep(pattern, vec)) + 1] <- as.character(vec[as.numeric(grep(pattern, vec))])
  return(newvariable)
}

df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)

EDIT: Why is my approach wrong? (Please note that the actual problem is bigger, so I have to stick to an approach, where a repetition of a pattern should be avoided)

Any help is much appreciated!

moodymudskipper · Accepted Answer · 2019-11-21 18:12:47Z

2

Having numbered items in a your workspace is a good sign that they really belong to a list, so they are formally linked and we can work with them much more easily. So let's do that first.

regex <- c("^[a-z]{5}$", "^[a-z]{6,}$")

Our core functionality is to copy a source vector, but remove elements that don't match, and leave NA in their place, so we'll make a function for that, and we'll name it explicitly so we understand intuitively what it's doing (and as will our colleagues next reader on SO ;) ) :

make_non_matches_NA <- function(vec, pattern){
  # logical indices of matches
  matches_lgl <- grepl(pattern, vec)
  # the elements which don't match should be NA
  vec[!matches_lgl] <- NA
  # resulting vector should be returned
  vec
}

Let's test this with first pattern

make_non_matches_NA(df$b, regex[[1]])
#> [1] apple <NA>  peach lemon <NA>  <NA> 
#> Levels: apple lemon orange peach pineapple tomato

So far so good! now let's test it with all regex, we avoid for loops when we can generally in R because we have clearer tools like lapply(). Here I want to apply this function to all regex expressions :

lapply(regex, make_non_matches_NA, vec = df$b)
#> [[1]]
#> [1] apple <NA>  peach lemon <NA>  <NA> 
#> Levels: apple lemon orange peach pineapple tomato
#> 
#> [[2]]
#> [1] <NA>      orange    <NA>      <NA>      pineapple tomato   
#> Levels: apple lemon orange peach pineapple tomato

Great, it works!

But I want this in my data.frame, not as a separate list, so I will assign this result to the relevant names in my df directly

df[c("fruit1", "fruit2")] <- lapply(regex, make_non_matches_NA, vec = df$b)
# then print my updated df
df
#>   a         b fruit1    fruit2
#> 1 1     apple  apple      <NA>
#> 2 2    orange   <NA>    orange
#> 3 3     peach  peach      <NA>
#> 4 4     lemon  lemon      <NA>
#> 5 5 pineapple   <NA> pineapple
#> 6 6    tomato   <NA>    tomato

tada!

edited Nov 21, 2019 at 18:12

answered Nov 21, 2019 at 14:47

moodymudskipper

47.7k12 gold badges131 silver badges185 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

stefan485 Over a year ago

Thank you for your answer Moody! Your answers are both strongly related to the specific problem, so it I my mistake that I asked wrong. In my data, one step is more complicated so I think I need to spot this complicated pattern of the 2 step procedure and use this pattern to form an execution of this multiple steps (not necessary a "for loop"). So to be more precise, need more or less an explanation why my approach doesn't work.

moodymudskipper Over a year ago

@stefan485 I hope my new answer works better. I tried to detail the steps, and more importantly, the reasoning. R can be frustrating at first but what helps is to cut the issue in small pieces and make sure you 100% understand each step, it will save you a lot of time in the future.

stefan485 Over a year ago

Oh yeah I see! The function then seams to work but the loop over is still broken. i.e. df <- create.function(df, "b", "fruit_1", regx_1) now works separately.

stefan485 Over a year ago

Can I ask you once again please why my updated try to a similar problem does not work? :/

moodymudskipper Over a year ago

1) You're right, vec is fed explicit so regex is passed, element by element, to the next argument, which is pattern

|

Cettt · Accepted Answer · 2019-11-21 14:47:32Z

1

I don't if this qualifies as "at one step" but you could try mutate from the dplyr package:

df <- data.frame(a=c(1:6), b=c("apple", "orange", "peach", "lemon", "pineapple", "tomato"), 
                 stringsAsFactors = FALSE)

Note that I set stringsAsFactors = FALSE inside data.frames.

dplyr::mutate(df, fruit_1 = if_else(grepl("^[a-z]{5}$", b), b, NA_character_),
       fruit_2 = if_else(grepl("^[a-z]{6}$", b), b, NA_character_))

  a         b fruit_1 fruit_2
1 1     apple   apple    <NA>
2 2    orange    <NA>  orange
3 3     peach   peach    <NA>
4 4     lemon   lemon    <NA>
5 5 pineapple    <NA>    <NA>
6 6    tomato    <NA>  tomato

answered Nov 21, 2019 at 14:47

Cettt

12k8 gold badges40 silver badges61 bronze badges

1 Comment

stefan485 Over a year ago

Thank you Cettt! Your answer does indeed help me, but I can't adopt it to my code, because it is not flexible enough and I can only make some parts of the code shorter. I have to put my question differently: How can I loop over a pattern of some code where the changes are summarised in a function or similarly.

Collectives™ on Stack Overflow

Create variables based on regular expressions with a loop in r

2 Answers 2

11 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related