1

EDITED FOR ADDED COMPLEXITY IN REAL DATA, SEE BELOW. Also accepting an answer based on how I originally asked the question. I tried looking at a few of the other available Q/A's, but none of the ones I looked at seemed to work for me and/or there wasn't enough detail for me to understand how to implement the solutions. I'm not completely accustomed to working with regex, so I'm having a hard time coming up with a pattern. I have multiple text strings, some of which could be combined within the data and these combined strings could be placed in either order. It's a long data set and I will have to repeat a similar process for multiple columns being created within the data frame with the contents based partly on the TRUE/FALSE from the str_detect function, so efficiency is rather important.

Also important to note, as I have seen it mentioned in other answers, I know nothing about Python/Perl. I'm working in RStudio.

First, a similar and simpler data frame to my data.

SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
note <- c("FWD FNDR DMG", 
          "MID BODY CHASS DMG", 
          "FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", 
          "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
df <- data.frame(SN, fwd_fender, note)

which produced this data frame:

SN fwd_fender                                       note
1 1001          1                               FWD FNDR DMG
2 1002          0                         MID BODY CHASS DMG
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH

What I am basically needing to do is to create another column that we'll call fwd_fender_mech_dmg where I can recode the observations based on the note column. In this example, both "FWD FNDR DMG" and "FWD FNDR EXCESS WEAR" count as "fwd_fender_mechanical_dmg". The other notes do not. So I want to end up producing a data frame that looks like the following:

SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

I have lots of columns with lots of different variables, so I'm trying to use a regex as much as possible (ideally) in order to make the coding more efficient, but I'm not getting it to work quite right.

So here is a basic test sequence and pattern.

yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH") 
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)

which results in the following:

[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

This is an expected results But notice, if I switch the order of the last two entries in maybe, the code producing incorrect results.

yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG") #Last Entry Reversed
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)

which produces this result:

[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

So, any ideas on how I can make this work?

Thanks!

EDIT

My real data is more complex than the original simplified version, so here is a data frame that even more closely mirrors my data.

SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
fwd_fender_note <- c("FWD FNDR DMG", 
          "MID BODY CHASS DMG", 
          "FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", 
          "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
rear_axel <- c(0, 1, 1, 1)
rear_axel_note <- c("#", "CORROSION", "CORR, CRACK", "CRACK, WEAR")
computer <- c(0, 1, 0, 0)
computer_note <- c("PROGRAM BUG", "ELEC FAULT", "#", "#")
mid_body_chass <- c(1, 1, 0, 1)
mid_body_chass_note <- c("MID BODY CHASS DMG", "WEAR", "#", "CORR")
df <- data.frame(SN, fwd_fender, fwd_fender_note, rear_axel, rear_axel_note, computer, computer_note, mid_body_chass, mid_body_chass_note)

which produces this data frame:

 SN fwd_fender                            fwd_fender_note rear_axel rear_axel_note computer computer_note mid_body_chass
1 1001          1                               FWD FNDR DMG         0              #        0   PROGRAM BUG              1
2 1002          0                         MID BODY CHASS DMG         1      CORROSION        1    ELEC FAULT              1
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG         1    CORR, CRACK        0             #              0
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH         1    CRACK, WEAR        0             #              1
  mid_body_chass_note
1  MID BODY CHASS DMG
2                WEAR
3                   #
4                CORR

Wear and cracks both count as "mechanical damage", so even though they are separate observations, they should only be counted as 1 in the fwd_fender_mech_dmg column (only identifying presence or absence of a given type of observation, not how many there are). Some observations are not specifically tracked in this analysis, so "program bugs" should just be a 0.

So what I am ultimately trying to get to, would be a data frame that looks a bit like the following:

rear_axel_corrosion <- c(0, 1, 1, 0)
computer_electrical <- c(0, 1, 0, 0)
mid_body_chass_mech_dmg <- c(1, 1, 1, 1)
df2 <- data.frame(SN, fwd_fender_mech_dmg, rear_axel_mech_dmg, rear_axel_corrosion, computer_electrical, mid_body_chass_mech_dmg)

df2
 SN fwd_fender_mech_dmg rear_axel_mech_dmg rear_axel_corrosion computer_electrical mid_body_chass_mech_dmg
1 1001                   1                  0                   0                   0                       1
2 1002                   0                  0                   1                   1                       1
3 1003                   1                  1                   1                   0                       1
4 1004                   0                  1                   0                   0                       1

Also notice, that sometimes notes get entered by operators into the wrong slots (so "MID BODY CHASS DMG" under the "fwd_fender_note", which means that there actually should be a 1 in that row for mid body chassis damage.

Whew, I know that is a lot, hopefully I didn't make it too complex. Thanks!

2 Answers 2

1

The problem with your code is that you are considering 'xxx , yyy' as two character elements (like in a character vector), but it is actually one two-word character scalar.

If we want your current regex to work, we can first str_split the strings by the comma, then call str_detect on all substrings, and, finally, reduce the output to a single logical per row.

library(stringr)
library(purrr)
library(dplyr)

df %>% mutate(fwd_fender_mech_dmg= str_split(note, ',') %>%
                      map(~str_detect(.x, "FE?ND.*(WEAR|DMG)")%>%
                      reduce(`|`)))
    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

this is consistent with any order of the coma-separated substrings:

df2<-df%>%mutate(note=replace(note, 4, maybe[2]))

df2 %>% mutate(fwd_fender_mech_dmg = as.integer(str_split(note, ',') %>%
                      map(~str_detect(.x, "FE?ND.*(WEAR|DMG)")%>%
                      reduce(`|`))))

    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG                   0

Some advice

It seems that your data is not "tidy". The "note" variable occasionally has two data values collapsed as a single char element.

It may make your life much easier in downstream analyses if you separate the data so that there will always be one observation per value.

For that, you may want to do like this:

library(dpplyr)
library(stringr)
library(tidyr)

df %>% tidyr::separate_rows(note, sep='\\s*,\\s*') %>% #this separates the rows
        mutate(fwd_fender_mech_dmg = +str_detect(note, "FE?ND.*(WEAR|DMG)"))

# A tibble: 6 x 4
     SN fwd_fender note                   fwd_fender_mech_dmg
  <dbl>      <dbl> <chr>                                <int>
1  1001          1 FWD FNDR DMG                             1
2  1002          0 MID BODY CHASS DMG                       0
3  1003          1 FWD FNDR EXCESS WEAR                     1
4  1003          1 MID BODY CHASS DMG                       0
5  1004          1 MID BODY CHASS DMG                       0
6  1004          1 FWD FNDR PAINT SCRATCH                   0
Sign up to request clarification or add additional context in comments.

9 Comments

I'm thinking your approach will work best, but to expand on it a bit: in my real data I don't have just one note column, I have 27 "note" columns where note is a suffix as I have in the current table 27 variables, each with their own set of notes to be interpreted. With your method, in the simplified data in my example, I can see how I could possibly use group_by and summary functions to help collapse the data back down to one row per serial number. Could I feasibly still make this solution work with the much larger data set given all the different "note" columns?
Hard to tell without aproper reproducible example. You can either accept/vote on the current answers and open a new question or edit the question and will see. I suspect we can do what you ask with dplyr and purrr
If there are several "note" columns, you can use something like df2 %>% mutate(across(starts_with('note'), ~as.integer(str_split(., ',') %>% map(~str_detect(.x, "FE?ND.*(WEAR|DMG)") %>% reduce(|)))))
thanks for the replies. I edited my post to also include a more complex sample data set. I still think your solution has merit. Because I have so many differences between my columns, might it be best to split the larger data frame into multiple data frames and then join/bind them back together? That seems like a lot of extra coding as there are nearly 30 different observation and note column pairs, but I really only need to write it once (this is an annual analysis). I'm still fairly new to R, so I know there is a bunch I still don't know.
Thanks, I see what you mean. I'll keep that in mind, maybe I should go ahead and ask a separate question with more the wrangling focus.
|
0

I'd recommend using the case_when function from the tidyverse:

df %>% 
  mutate(fwd_fender_mech_dmg = case_when(grepl("FWD FNDR DMG", note) ~ 1,
                                         grepl("FWD FNDR EXCESS WEAR", note) ~ 1,
                                         TRUE ~ 0))

Output:

    SN fwd_fender                                       note fwd_fender_mech_dmg
1 1001          1                               FWD FNDR DMG                   1
2 1002          0                         MID BODY CHASS DMG                   0
3 1003          1   FWD FNDR EXCESS WEAR, MID BODY CHASS DMG                   1
4 1004          1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH                   0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.