1

I have been unsuccessfully trying to filter my data frame using the dplyr and grep libraries using a list of string across multiple columns of my data frame. I would assume this is a simple task, but either nobody has asked my specific question or it's not as easy as I thought it would originally be.

For the following data frame...

foo <- data.frame(var.1 = c('a', 'b',' c'),
           var.2 = c('b', 'd', 'e'),
           var.3 = c('c', 'f', 'g'),
           var.4 = c('z', 'a', 'b'))

... I would like to be able to filter row wise to find rows that contain all three variables a, b, and c in them. My sought after answer would only return row 1, as it contains a, b, and c, and not return rows 2 and 3 even though they contain two of the three sought after variables, they do not contain all three in the same row.

I'm running into issues where grep only allows specifying vectors or one column at a time when I really just care about finding string across many columns in the same row.

I've also used dplyr to filter using %in%, but it just returns when any of the variables are present:

foo %>% 
  filter(var.1 %in% c('a', 'b', 'c') |
           var.2 %in% c('a', 'b', 'c') |
           var.3 %in% c('a', 'b', 'c'))

Thanks for any and all help and please, let me know if you need any clarification!

3
  • 2
    foo[apply(foo, 1, function(x) all(c('a', 'b', 'c') %in% x)), ] Commented Jul 10, 2017 at 1:31
  • 1
    apply( foo, 2, function(x) all( grepl(x = x, pattern = "[abc]" ) )) Commented Jul 10, 2017 at 1:36
  • 1
    apply( foo, 2, function(x) sum( grepl(x = x, pattern = "[abc]" ) ) == 3) Commented Jul 10, 2017 at 1:38

2 Answers 2

4

Here's an approach in base R where we check if the elements of foo are equal to "a", "b", or "c" successively, add the Booleans and check if the sum of those Booleans for each row is greater than or equal to 3

Reduce("+", lapply(c("a", "b", "c"), function(x) rowSums(foo == x) > 0)) >=3
#[1]  TRUE FALSE FALSE

Timings

foo = matrix(sample(letters[1:26], 1e7, replace = TRUE), ncol = 5)
system.time(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20)
#   user  system elapsed 
#   3.26    0.48    3.79 

system.time(apply(foo, 1, function(x) all(letters[1:20] %in% x)))
#   user  system elapsed 
#  18.86    0.00   19.19 


identical(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20, 
          apply(foo, 1, function(x) all(letters[1:20] %in% x)))
#[1] TRUE
> 
Sign up to request clarification or add additional context in comments.

Comments

2

Your problem arises from trying to apply "tidyverse" solutions to data that isn't tidy. Here's the tidy solution, which uses melt to make your data tidy. See how much tidier this solution is?

> library(reshape2)
> rows = foo %>%
      mutate(id=1:nrow(foo)) %>% 
      melt(id="id") %>% 
      filter(value=="a" | value=="b" | value=="c") %>%
      group_by(id) %>% 
      summarize(N=n()) %>% 
      filter(N==3) %>%
      select(id) %>%
      unlist
Warning message:
attributes are not identical across measure variables; they will be dropped 

That gives you a vector of matching row indexes, which you can then subset your original data frame with:

> foo[rows,]
  var.1 var.2 var.3 var.4
1     a     b     c     z
> 

1 Comment

value=="a" | value=="b" | value=="c" could be value %in% c("a","b","c") surely.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.