0

Data

I have a data frame with a single column consisting of strings in R.

data <- structure(list(col = c("byr:1985 eyr:2021 iyr:2011 hgt:175cm pid:163069444 hcl:#18171d", 
                       "eyr:2023 hcl:#cfa07d ecl:blu hgt:169cm pid:494407412 byr:1936", 
                       "ecl:zzz eyr:2036 hgt:109 hcl:#623a2f iyr:1997 byr:2029 cid:169 pid:170290956", 
                       "hcl:#18171d ecl:oth pid:266824158 hgt:168cm byr:1992 eyr:2021", 
                       "byr:1932 ecl:hzl pid:284313291 iyr:2017 hcl:#efcc98 eyr:2024 hgt:184cm"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Problem

I want to filter this data frame on the rows that contain the following patterns/fields:

fields <- c("ecl", "eyr", "hgt", "hcl", "iyr", "byr", "pid")

In other words, I would like to obtain the rows that do contain each of these fields.

Attempt

The stringr package and str_detect function seemed to be the solution! So, I tested it on a single case:

> data$col[1]
[1] "byr:1985 eyr:2021 iyr:2011 hgt:175cm pid:163069444 hcl:#18171d"
> str_detect(data$col[1], fields)
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> all(str_detect(data$col[1], fields))
[1] FALSE

This works! If any of the fields are not present in the string, it is evaluated as false.

However, when trying to filter the rows using this option:

data %>% 
    filter( all(str_detect(col, fields)) )

I end up with an empty data frame, and a warning:

Warning message: In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : longer object length is not a multiple of shorter object length

Question(s)

  • What is causing this warning?
  • How do you filter a column of strings on the occurrence of multiple patterns in R?

1 Answer 1

1

The reason why you get the warning is because str_detect is vectorised function meaning 1st value in col is matched with 1st value of fields, 2nd value with the 2nd and so on. Length of col is 5 and length of fields is 7 so their lengths are incompatible and that is what the warning is saying.

To filter the rows in data where each value of fields is present in base R, you could do :

data[Reduce(`&`, lapply(fields, grepl, data$col)), ]

#  col                                                                         
#  <chr>                                                                       
#1 ecl:zzz eyr:2036 hgt:109 hcl:#623a2f iyr:1997 byr:2029 cid:169 pid:170290956
#2 byr:1932 ecl:hzl pid:284313291 iyr:2017 hcl:#efcc98 eyr:2024 hgt:184cm      

If you are interested in tidyverse answer you could write the above as :

library(tidyverse)

data %>% filter(map(fields, ~str_detect(data$col, .x)) %>% reduce(`&`))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.