Stringr regex to work even when string is in a different order

Question

EDITED FOR ADDED COMPLEXITY IN REAL DATA, SEE BELOW. Also accepting an answer based on how I originally asked the question. I tried looking at a few of the other available Q/A's, but none of the ones I looked at seemed to work for me and/or there wasn't enough detail for me to understand how to implement the solutions. I'm not completely accustomed to working with regex, so I'm having a hard time coming up with a pattern. I have multiple text strings, some of which could be combined within the data and these combined strings could be placed in either order. It's a long data set and I will have to repeat a similar process for multiple columns being created within the data frame with the contents based partly on the TRUE/FALSE from the str_detect function, so efficiency is rather important.

Also important to note, as I have seen it mentioned in other answers, I know nothing about Python/Perl. I'm working in RStudio.

First, a similar and simpler data frame to my data.

SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
note <- c("FWD FNDR DMG", 
          "MID BODY CHASS DMG", 
          "FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", 
          "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
df <- data.frame(SN, fwd_fender, note)

which produced this data frame:

SN fwd_fender                                       note
1 1001          1                               FWD FNDR DMG
2 1002          0                         MID BODY CHASS DMG
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH

What I am basically needing to do is to create another column that we'll call fwd_fender_mech_dmg where I can recode the observations based on the note column. In this example, both "FWD FNDR DMG" and "FWD FNDR EXCESS WEAR" count as "fwd_fender_mechanical_dmg". The other notes do not. So I want to end up producing a data frame that looks like the following:

SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

I have lots of columns with lots of different variables, so I'm trying to use a regex as much as possible (ideally) in order to make the coding more efficient, but I'm not getting it to work quite right.

So here is a basic test sequence and pattern.

yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH") 
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)

which results in the following:

[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

This is an expected results But notice, if I switch the order of the last two entries in maybe, the code producing incorrect results.

yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG") #Last Entry Reversed
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)

which produces this result:

[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

So, any ideas on how I can make this work?

Thanks!

EDIT

My real data is more complex than the original simplified version, so here is a data frame that even more closely mirrors my data.

SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
fwd_fender_note <- c("FWD FNDR DMG", 
          "MID BODY CHASS DMG", 
          "FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", 
          "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
rear_axel <- c(0, 1, 1, 1)
rear_axel_note <- c("#", "CORROSION", "CORR, CRACK", "CRACK, WEAR")
computer <- c(0, 1, 0, 0)
computer_note <- c("PROGRAM BUG", "ELEC FAULT", "#", "#")
mid_body_chass <- c(1, 1, 0, 1)
mid_body_chass_note <- c("MID BODY CHASS DMG", "WEAR", "#", "CORR")
df <- data.frame(SN, fwd_fender, fwd_fender_note, rear_axel, rear_axel_note, computer, computer_note, mid_body_chass, mid_body_chass_note)

which produces this data frame:

 SN fwd_fender                            fwd_fender_note rear_axel rear_axel_note computer computer_note mid_body_chass
1 1001          1                               FWD FNDR DMG         0              #        0   PROGRAM BUG              1
2 1002          0                         MID BODY CHASS DMG         1      CORROSION        1    ELEC FAULT              1
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG         1    CORR, CRACK        0             #              0
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH         1    CRACK, WEAR        0             #              1
  mid_body_chass_note
1  MID BODY CHASS DMG
2                WEAR
3                   #
4                CORR

Wear and cracks both count as "mechanical damage", so even though they are separate observations, they should only be counted as 1 in the fwd_fender_mech_dmg column (only identifying presence or absence of a given type of observation, not how many there are). Some observations are not specifically tracked in this analysis, so "program bugs" should just be a 0.

So what I am ultimately trying to get to, would be a data frame that looks a bit like the following:

rear_axel_corrosion <- c(0, 1, 1, 0)
computer_electrical <- c(0, 1, 0, 0)
mid_body_chass_mech_dmg <- c(1, 1, 1, 1)
df2 <- data.frame(SN, fwd_fender_mech_dmg, rear_axel_mech_dmg, rear_axel_corrosion, computer_electrical, mid_body_chass_mech_dmg)

df2
 SN fwd_fender_mech_dmg rear_axel_mech_dmg rear_axel_corrosion computer_electrical mid_body_chass_mech_dmg
1 1001                   1                  0                   0                   0                       1
2 1002                   0                  0                   1                   1                       1
3 1003                   1                  1                   1                   0                       1
4 1004                   0                  1                   0                   0                       1

Also notice, that sometimes notes get entered by operators into the wrong slots (so "MID BODY CHASS DMG" under the "fwd_fender_note", which means that there actually should be a 1 in that row for mid body chassis damage.

Whew, I know that is a lot, hopefully I didn't make it too complex. Thanks!

GuedesBF · Accepted Answer · 2021-09-23 01:03:46Z

1

The problem with your code is that you are considering 'xxx , yyy' as two character elements (like in a character vector), but it is actually one two-word character scalar.

If we want your current regex to work, we can first str_split the strings by the comma, then call str_detect on all substrings, and, finally, reduce the output to a single logical per row.

library(stringr)
library(purrr)
library(dplyr)

df %>% mutate(fwd_fender_mech_dmg= str_split(note, ',') %>%
                      map(~str_detect(.x, "FE?ND.*(WEAR|DMG)")%>%
                      reduce(`|`)))
    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

this is consistent with any order of the coma-separated substrings:

df2<-df%>%mutate(note=replace(note, 4, maybe[2]))

df2 %>% mutate(fwd_fender_mech_dmg = as.integer(str_split(note, ',') %>%
                      map(~str_detect(.x, "FE?ND.*(WEAR|DMG)")%>%
                      reduce(`|`))))

    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG                   0

Some advice

It seems that your data is not "tidy". The "note" variable occasionally has two data values collapsed as a single char element.

It may make your life much easier in downstream analyses if you separate the data so that there will always be one observation per value.

For that, you may want to do like this:

library(dpplyr)
library(stringr)
library(tidyr)

df %>% tidyr::separate_rows(note, sep='\\s*,\\s*') %>% #this separates the rows
        mutate(fwd_fender_mech_dmg = +str_detect(note, "FE?ND.*(WEAR|DMG)"))

# A tibble: 6 x 4
     SN fwd_fender note                   fwd_fender_mech_dmg
  <dbl>      <dbl> <chr>                                <int>
1  1001          1 FWD FNDR DMG                             1
2  1002          0 MID BODY CHASS DMG                       0
3  1003          1 FWD FNDR EXCESS WEAR                     1
4  1003          1 MID BODY CHASS DMG                       0
5  1004          1 MID BODY CHASS DMG                       0
6  1004          1 FWD FNDR PAINT SCRATCH                   0

edited Sep 23, 2021 at 1:03

answered Sep 23, 2021 at 0:47

GuedesBF

9,9515 gold badges23 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

joerminer Over a year ago

I'm thinking your approach will work best, but to expand on it a bit: in my real data I don't have just one note column, I have 27 "note" columns where note is a suffix as I have in the current table 27 variables, each with their own set of notes to be interpreted. With your method, in the simplified data in my example, I can see how I could possibly use group_by and summary functions to help collapse the data back down to one row per serial number. Could I feasibly still make this solution work with the much larger data set given all the different "note" columns?

GuedesBF Over a year ago

Hard to tell without aproper reproducible example. You can either accept/vote on the current answers and open a new question or edit the question and will see. I suspect we can do what you ask with dplyr and purrr

GuedesBF Over a year ago

If there are several "note" columns, you can use something like df2 %>% mutate(across(starts_with('note'), ~as.integer(str_split(., ',') %>% map(~str_detect(.x, "FE?ND.*(WEAR|DMG)") %>% reduce(|)))))

joerminer Over a year ago

thanks for the replies. I edited my post to also include a more complex sample data set. I still think your solution has merit. Because I have so many differences between my columns, might it be best to split the larger data frame into multiple data frames and then join/bind them back together? That seems like a lot of extra coding as there are nearly 30 different observation and note column pairs, but I really only need to write it once (this is an annual analysis). I'm still fairly new to R, so I know there is a bunch I still don't know.

joerminer Over a year ago

Thanks, I see what you mean. I'll keep that in mind, maybe I should go ahead and ask a separate question with more the wrangling focus.

|

frustrated_bioinformatician · Accepted Answer · 2021-09-22 23:50:05Z

0

I'd recommend using the case_when function from the tidyverse:

df %>% 
  mutate(fwd_fender_mech_dmg = case_when(grepl("FWD FNDR DMG", note) ~ 1,
                                         grepl("FWD FNDR EXCESS WEAR", note) ~ 1,
                                         TRUE ~ 0))

Output:

    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

answered Sep 22, 2021 at 23:50

frustrated_bioinformatician

3932 silver badges14 bronze badges

Collectives™ on Stack Overflow

Stringr regex to work even when string is in a different order

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related