Filter dataframe in R on occurrence of multiple patterns in a string

Question

Data

I have a data frame with a single column consisting of strings in R.

data <- structure(list(col = c("byr:1985 eyr:2021 iyr:2011 hgt:175cm pid:163069444 hcl:#18171d", 
                       "eyr:2023 hcl:#cfa07d ecl:blu hgt:169cm pid:494407412 byr:1936", 
                       "ecl:zzz eyr:2036 hgt:109 hcl:#623a2f iyr:1997 byr:2029 cid:169 pid:170290956", 
                       "hcl:#18171d ecl:oth pid:266824158 hgt:168cm byr:1992 eyr:2021", 
                       "byr:1932 ecl:hzl pid:284313291 iyr:2017 hcl:#efcc98 eyr:2024 hgt:184cm"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Problem

I want to filter this data frame on the rows that contain the following patterns/fields:

fields <- c("ecl", "eyr", "hgt", "hcl", "iyr", "byr", "pid")

In other words, I would like to obtain the rows that do contain each of these fields.

Attempt

The stringr package and str_detect function seemed to be the solution! So, I tested it on a single case:

> data$col[1]
[1] "byr:1985 eyr:2021 iyr:2011 hgt:175cm pid:163069444 hcl:#18171d"
> str_detect(data$col[1], fields)
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> all(str_detect(data$col[1], fields))
[1] FALSE

This works! If any of the fields are not present in the string, it is evaluated as false.

However, when trying to filter the rows using this option:

data %>% 
    filter( all(str_detect(col, fields)) )

I end up with an empty data frame, and a warning:

Warning message: In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : longer object length is not a multiple of shorter object length

Question(s)

What is causing this warning?
How do you filter a column of strings on the occurrence of multiple patterns in R?

Ronak Shah · Accepted Answer · 2020-12-22 10:05:58Z

1

The reason why you get the warning is because str_detect is vectorised function meaning 1st value in col is matched with 1st value of fields, 2nd value with the 2nd and so on. Length of col is 5 and length of fields is 7 so their lengths are incompatible and that is what the warning is saying.

To filter the rows in data where each value of fields is present in base R, you could do :

data[Reduce(`&`, lapply(fields, grepl, data$col)), ]

#  col                                                                         
#  <chr>                                                                       
#1 ecl:zzz eyr:2036 hgt:109 hcl:#623a2f iyr:1997 byr:2029 cid:169 pid:170290956
#2 byr:1932 ecl:hzl pid:284313291 iyr:2017 hcl:#efcc98 eyr:2024 hgt:184cm

If you are interested in tidyverse answer you could write the above as :

library(tidyverse)

data %>% filter(map(fields, ~str_detect(data$col, .x)) %>% reduce(`&`))

edited Dec 22, 2020 at 10:05

answered Dec 22, 2020 at 10:00

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Filter dataframe in R on occurrence of multiple patterns in a string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related