Populating variable values in R dataframe based on string matching

Question

I have the following dataframe in R:

df <- data.frame(Sample_name = c("01_00H_NA_DNA",   "01_00H_NA_RNA",    "01_00H_NA_S",  "01_00H_NW_DNA",    "01_00H_NW_RNA",    "01_00H_NW_S",  "01_00H_OM_DNA",    "01_00H_OM_RNA",    "01_00H_OM_S",  "01_00H_RL_DNA",    "01_00H_RL_RNA",    "01_00H_RL_S"),
             Pair = c("","", "S1","","","S2","","","S3","", "","S5"))

I would like to generate a new variable Label such that similar strings in Sample_name until the last _ before DNA/RNA or S get matched to give a similar label Id number. While each row may not start with 01_00H, there will always be similar strings until the last underscore to group for the label variable.

Furthermore, I would like to also fill the pair variable with similar values, S1 for all identical labels and so on. The existing Pair values are not continuous i.e S3 is followed by S5 and so on.

Resulting dataframe will look something like this:

This has been incredibly hard to do, I followed How to create new column in dataframe based on partial string matching other column in R but it helped me only partially for direct 1:1 renaming.

Any solutions from useRs will be much appreciated, Thanks!

I'm not entirely sure if I understand what you wish to be matched - does factor(tmp <- sub("(^.+)_(DNA|RNA|S)$", "\\1", df$Sample_name), labels=seq_along(unique(tmp))) for instance work for your real data? — thelatemail
– thelatemail, Commented Mar 21, 2017 at 1:32
Hi! Thanks, indeed the solution you provided works to create a new variable Label in my df How do I complete the second part of the question, where all identical labels get a corresponding Pair value (i.e the missing rows in Pair get the same Pair ID S1 or S2 or S3 as the one available value for that group in the original df? — Manasi Shah
– Manasi Shah, Commented Mar 21, 2017 at 1:37

r2evans · Accepted Answer · 2017-03-21 13:56:49Z

1

Try this:

df$x <- gsub("_[^_]+$", "", df$Sample_name)
df$Label <- match(df$x, unique(df$x))
df$Pair <- ave(as.character(df$Pair), df$Label, FUN=max)
df$x <- NULL
df
#      Sample_name Pair Label
# 1  01_00H_NA_DNA   S1     1
# 2  01_00H_NA_RNA   S1     1
# 3    01_00H_NA_S   S1     1
# 4  01_00H_NW_DNA   S2     2
# 5  01_00H_NW_RNA   S2     2
# 6    01_00H_NW_S   S2     2
# 7  01_00H_OM_DNA   S3     3
# 8  01_00H_OM_RNA   S3     3
# 9    01_00H_OM_S   S3     3
# 10 01_00H_RL_DNA   S5     4
# 11 01_00H_RL_RNA   S5     4
# 12   01_00H_RL_S   S5     4

Or using dplyr:

library(dplyr)
df %>%
  mutate(
    x = gsub("_[^_]+$", "", Sample_name),
    Label = match(x, unique(x))
  ) %>%
  select(-x) %>%
  group_by(Label) %>%
  mutate(Pair = paste0(Pair, collapse = "")) %>%
  ungroup()
# # A tibble: 12 × 3
#      Sample_name  Pair Label
#           <fctr> <chr> <int>
# 1  01_00H_NA_DNA    S1     1
# 2  01_00H_NA_RNA    S1     1
# 3    01_00H_NA_S    S1     1
# 4  01_00H_NW_DNA    S2     2
# 5  01_00H_NW_RNA    S2     2
# 6    01_00H_NW_S    S2     2
# 7  01_00H_OM_DNA    S3     3
# 8  01_00H_OM_RNA    S3     3
# 9    01_00H_OM_S    S3     3
# 10 01_00H_RL_DNA    S5     4
# 11 01_00H_RL_RNA    S5     4
# 12   01_00H_RL_S    S5     4

Edit: added @thelatemail's use of ave, better by codegolf and stability.

edited Mar 21, 2017 at 13:56

answered Mar 21, 2017 at 1:39

r2evans

167k8 gold badges92 silver badges176 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

thelatemail Over a year ago

ave will be safer than unlist(by(..., which I think relies on the dataset being ordered.

r2evans Over a year ago

Personally, I prefer the dplyr solution. I'm not as fluent with ave ...

thelatemail Over a year ago

Just ave(as.character(df$Pair), df$Label, FUN=max) will do it.

Manasi Shah Over a year ago

Hi, thanks, I upvoted and accepted your answer but have a follow up question. What if I were to generate another variable where from the original variable value 01_00H_NA_DNA I now want to make a new variable based on combining similar 1st and 3rd parts of the above string i.e all the 01_NA values be called fix1 in a new variable called new. How would the gsub expression change for that? Thanks!

r2evans Over a year ago

If you are certain that the strings will always have the same structure of 3+ groups separated by underscore, then you can use mutate(df, name2 = gsub("^([^_]+)_[^_]+_([^_]+)_?.*$", "\\1_\\2", name)). (The last portion, _?.*$, handles a third or more underscore, leading to a fourth or more groups.) An alternative is to use strsplit, but that takes some more work since it returns lists, and is much less direct than this and is therefore a bit slower.

Collectives™ on Stack Overflow

Populating variable values in R dataframe based on string matching

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related