EDITED FOR ADDED COMPLEXITY IN REAL DATA, SEE BELOW. Also accepting an answer based on how I originally asked the question. I tried looking at a few of the other available Q/A's, but none of the ones I looked at seemed to work for me and/or there wasn't enough detail for me to understand how to implement the solutions. I'm not completely accustomed to working with regex, so I'm having a hard time coming up with a pattern. I have multiple text strings, some of which could be combined within the data and these combined strings could be placed in either order. It's a long data set and I will have to repeat a similar process for multiple columns being created within the data frame with the contents based partly on the TRUE/FALSE from the str_detect function, so efficiency is rather important.
Also important to note, as I have seen it mentioned in other answers, I know nothing about Python/Perl. I'm working in RStudio.
First, a similar and simpler data frame to my data.
SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
note <- c("FWD FNDR DMG",
"MID BODY CHASS DMG",
"FWD FNDR EXCESS WEAR, MID BODY CHASS DMG",
"MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
df <- data.frame(SN, fwd_fender, note)
which produced this data frame:
SN fwd_fender note
1 1001 1 FWD FNDR DMG
2 1002 0 MID BODY CHASS DMG
3 1003 1 FWD FNDR EXCESS WEAR, MID BODY CHASS DMG
4 1004 1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH
What I am basically needing to do is to create another column that we'll call fwd_fender_mech_dmg where I can recode the observations based on the note column. In this example, both "FWD FNDR DMG" and "FWD FNDR EXCESS WEAR" count as "fwd_fender_mechanical_dmg". The other notes do not. So I want to end up producing a data frame that looks like the following:
SN fwd_fender note fwd_fender_mech_dmg
1 1001 1 FWD FNDR DMG 1
2 1002 0 MID BODY CHASS DMG 0
3 1003 1 FWD FNDR EXCESS WEAR, MID BODY CHASS DMG 1
4 1004 1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH 0
I have lots of columns with lots of different variables, so I'm trying to use a regex as much as possible (ideally) in order to make the coding more efficient, but I'm not getting it to work quite right.
So here is a basic test sequence and pattern.
yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)
which results in the following:
[1] TRUE TRUE FALSE FALSE TRUE FALSE
This is an expected results But notice, if I switch the order of the last two entries in maybe, the code producing incorrect results.
yes <- c("FWD FNDR DMG", "FWD FENDER EXCESS WEAR")
no <- c("MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH")
maybe <- c("FWD FNDR EXCESS WEAR, MID BODY CHASS DMG", "FWD FNDR PAINT SCRATCH, MID BODY CHASS DMG") #Last Entry Reversed
s <- c(yes, no, maybe)
pattern <- "FE?ND.*(WEAR|DMG)"
str_detect(s,pattern,negate = FALSE)
which produces this result:
[1] TRUE TRUE FALSE FALSE TRUE TRUE
So, any ideas on how I can make this work?
Thanks!
EDIT
My real data is more complex than the original simplified version, so here is a data frame that even more closely mirrors my data.
SN <- c(1001, 1002, 1003, 1004)
fwd_fender <- c(1, 0, 1, 1)
fwd_fender_note <- c("FWD FNDR DMG",
"MID BODY CHASS DMG",
"FWD FNDR EXCESS WEAR, MID BODY CHASS DMG",
"MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH")
rear_axel <- c(0, 1, 1, 1)
rear_axel_note <- c("#", "CORROSION", "CORR, CRACK", "CRACK, WEAR")
computer <- c(0, 1, 0, 0)
computer_note <- c("PROGRAM BUG", "ELEC FAULT", "#", "#")
mid_body_chass <- c(1, 1, 0, 1)
mid_body_chass_note <- c("MID BODY CHASS DMG", "WEAR", "#", "CORR")
df <- data.frame(SN, fwd_fender, fwd_fender_note, rear_axel, rear_axel_note, computer, computer_note, mid_body_chass, mid_body_chass_note)
which produces this data frame:
SN fwd_fender fwd_fender_note rear_axel rear_axel_note computer computer_note mid_body_chass
1 1001 1 FWD FNDR DMG 0 # 0 PROGRAM BUG 1
2 1002 0 MID BODY CHASS DMG 1 CORROSION 1 ELEC FAULT 1
3 1003 1 FWD FNDR EXCESS WEAR, MID BODY CHASS DMG 1 CORR, CRACK 0 # 0
4 1004 1 MID BODY CHASS DMG, FWD FNDR PAINT SCRATCH 1 CRACK, WEAR 0 # 1
mid_body_chass_note
1 MID BODY CHASS DMG
2 WEAR
3 #
4 CORR
Wear and cracks both count as "mechanical damage", so even though they are separate observations, they should only be counted as 1 in the fwd_fender_mech_dmg column (only identifying presence or absence of a given type of observation, not how many there are). Some observations are not specifically tracked in this analysis, so "program bugs" should just be a 0.
So what I am ultimately trying to get to, would be a data frame that looks a bit like the following:
rear_axel_corrosion <- c(0, 1, 1, 0)
computer_electrical <- c(0, 1, 0, 0)
mid_body_chass_mech_dmg <- c(1, 1, 1, 1)
df2 <- data.frame(SN, fwd_fender_mech_dmg, rear_axel_mech_dmg, rear_axel_corrosion, computer_electrical, mid_body_chass_mech_dmg)
df2
SN fwd_fender_mech_dmg rear_axel_mech_dmg rear_axel_corrosion computer_electrical mid_body_chass_mech_dmg
1 1001 1 0 0 0 1
2 1002 0 0 1 1 1
3 1003 1 1 1 0 1
4 1004 0 1 0 0 1
Also notice, that sometimes notes get entered by operators into the wrong slots (so "MID BODY CHASS DMG" under the "fwd_fender_note", which means that there actually should be a 1 in that row for mid body chassis damage.
Whew, I know that is a lot, hopefully I didn't make it too complex. Thanks!