Partial matching of elements in two string columns in R

Question

I have a large data grouped by two identifiers (Group and ID), Initial column that shows in an initial time period, and a Post column to show elements that occur following the initial time period. A working examples is below:

SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
Initial=c('F28D,G06F','F24J ,'G01N'), 
Post=c('G06F','H02G','F23C,H02G,G01N'))

I want to compare elements in Initial and Post for each Group/ID combination to find out when elements match, when only new elements exist, and when both pre-existing and new elements exist. Ideally, I would like to end up with a new Type variable with the following output:

SampleDF<-cbind(SampleDF, 'Type'=rbind(0,1,2))

where (relative to Initial) 0 indicates that there are no new element(s) in Post, 1 indicates that there are only new element(s) in Post, and 2 indicates that there are both pre-existing and new element(s) in Post.

Input is missing a ``` '```

rsmith54
– rsmith54

2017-10-24 16:16:11 +00:00
Commented Oct 24, 2017 at 16:16 — rsmith54
– rsmith54, Commented Oct 24, 2017 at 16:16

Santosh M. · Accepted Answer · 2017-10-24 13:10:24Z

Your situation is complex since your pattern and vector varies while doing string matching using agrepl. So, here I come up with solution which is quite tricky but does the job very well.

element_counter = list()
for (i in 1:length(SampleDF$Initial)) {
  if (length(strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]) > 1) {
    element_counter[[i]] <- length(as.character(SampleDF$Post[i])) - sum(agrepl(as.character(SampleDF$Post[i]),strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]))
  }   else { 
    element_counter[[i]] <- length(strsplit(as.character(SampleDF$Post[i]), ",")[[1]]) - sum(agrepl(SampleDF$Initial[i], strsplit(as.character(SampleDF$Post[i]), ",")[[1]]))
  }
}

SampleDF$Type <- unlist(element_counter) 


## SampleDF
#   Group  ID   Initial             Post  Type
#1     0   2  F28D,G06F             G06F    0
#2     0   2       F24J             H02G    1
#3     1   3       G01N   F23C,H02G,G01N    2

jpshanno · Accepted Answer · 2017-10-24 13:40:20Z

I split the process into two steps, finding rows with new values, and then finding rows with only new values. Adding those two logical vectors together will create types. The only caveat is that the type definitions are a little different then your question definitions. 0 indicates no new measures, 1 indicates that there are new and pre-existing measures, and 2 indicates that there are only pre-existing measures.

# This approach needs character columns not strings, so stringsAsFactors = FALSE
SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
                     Initial=c('F28D,G06F','F24J' ,'G01N'), 
                               Post=c('G06F','H02G','F23C,H02G,G01N'),
                     stringsAsFactors = FALSE)

# Identify rows where there are new occurrences in Post that are not present in Initial
SampleDF$anyNewOccurrences <- 
  mapply(FUN = function(pattern, x){
    any(!grepl(pattern, x))}, 
    pattern = gsub("," , "|", SampleDF$Initial), 
    x = strsplit(SampleDF$Post, ","))

# Identify rows where there are only new occurences (no repeated values from Initial)
SampleDF$onlyNewOccurrences <- 
  mapply(FUN = function(pattern, x){
    all(!grepl(pattern, x))}, 
    pattern = gsub("," , "|", SampleDF$Initial), 
    x = strsplit(SampleDF$Post, ","))

# Add the two value to gether to create a type code
SampleDF$Type <- SampleDF$onlyNewOccurrences + SampleDF$anyNewOccurrences

Collectives™ on Stack Overflow

Partial matching of elements in two string columns in R

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related