0

I have a large data grouped by two identifiers (Group and ID), Initial column that shows in an initial time period, and a Post column to show elements that occur following the initial time period. A working examples is below:

SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
Initial=c('F28D,G06F','F24J ,'G01N'), 
Post=c('G06F','H02G','F23C,H02G,G01N'))

I want to compare elements in Initial and Post for each Group/ID combination to find out when elements match, when only new elements exist, and when both pre-existing and new elements exist. Ideally, I would like to end up with a new Type variable with the following output:

SampleDF<-cbind(SampleDF, 'Type'=rbind(0,1,2))

where (relative to Initial) 0 indicates that there are no new element(s) in Post, 1 indicates that there are only new element(s) in Post, and 2 indicates that there are both pre-existing and new element(s) in Post.

1
  • Input is missing a ``` '``` Commented Oct 24, 2017 at 16:16

2 Answers 2

1

Your situation is complex since your pattern and vector varies while doing string matching using agrepl. So, here I come up with solution which is quite tricky but does the job very well.

element_counter = list()
for (i in 1:length(SampleDF$Initial)) {
  if (length(strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]) > 1) {
    element_counter[[i]] <- length(as.character(SampleDF$Post[i])) - sum(agrepl(as.character(SampleDF$Post[i]),strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]))
  }   else { 
    element_counter[[i]] <- length(strsplit(as.character(SampleDF$Post[i]), ",")[[1]]) - sum(agrepl(SampleDF$Initial[i], strsplit(as.character(SampleDF$Post[i]), ",")[[1]]))
  }
}

SampleDF$Type <- unlist(element_counter) 


## SampleDF
#   Group  ID   Initial             Post  Type
#1     0   2  F28D,G06F             G06F    0
#2     0   2       F24J             H02G    1
#3     1   3       G01N   F23C,H02G,G01N    2
Sign up to request clarification or add additional context in comments.

Comments

1

I split the process into two steps, finding rows with new values, and then finding rows with only new values. Adding those two logical vectors together will create types. The only caveat is that the type definitions are a little different then your question definitions. 0 indicates no new measures, 1 indicates that there are new and pre-existing measures, and 2 indicates that there are only pre-existing measures.

# This approach needs character columns not strings, so stringsAsFactors = FALSE
SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
                     Initial=c('F28D,G06F','F24J' ,'G01N'), 
                               Post=c('G06F','H02G','F23C,H02G,G01N'),
                     stringsAsFactors = FALSE)

# Identify rows where there are new occurrences in Post that are not present in Initial
SampleDF$anyNewOccurrences <- 
  mapply(FUN = function(pattern, x){
    any(!grepl(pattern, x))}, 
    pattern = gsub("," , "|", SampleDF$Initial), 
    x = strsplit(SampleDF$Post, ","))

# Identify rows where there are only new occurences (no repeated values from Initial)
SampleDF$onlyNewOccurrences <- 
  mapply(FUN = function(pattern, x){
    all(!grepl(pattern, x))}, 
    pattern = gsub("," , "|", SampleDF$Initial), 
    x = strsplit(SampleDF$Post, ","))

# Add the two value to gether to create a type code
SampleDF$Type <- SampleDF$onlyNewOccurrences + SampleDF$anyNewOccurrences

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.