2

I have the following dataframe in R:

df <- data.frame(Sample_name = c("01_00H_NA_DNA",   "01_00H_NA_RNA",    "01_00H_NA_S",  "01_00H_NW_DNA",    "01_00H_NW_RNA",    "01_00H_NW_S",  "01_00H_OM_DNA",    "01_00H_OM_RNA",    "01_00H_OM_S",  "01_00H_RL_DNA",    "01_00H_RL_RNA",    "01_00H_RL_S"),
             Pair = c("","", "S1","","","S2","","","S3","", "","S5"))

I would like to generate a new variable Label such that similar strings in Sample_name until the last _ before DNA/RNA or S get matched to give a similar label Id number. While each row may not start with 01_00H, there will always be similar strings until the last underscore to group for the label variable.

Furthermore, I would like to also fill the pair variable with similar values, S1 for all identical labels and so on. The existing Pair values are not continuous i.e S3 is followed by S5 and so on.

Resulting dataframe will look something like this:

The resulting dataframe will look something like this:

This has been incredibly hard to do, I followed How to create new column in dataframe based on partial string matching other column in R but it helped me only partially for direct 1:1 renaming.

Any solutions from useRs will be much appreciated, Thanks!

2
  • 1
    I'm not entirely sure if I understand what you wish to be matched - does factor(tmp <- sub("(^.+)_(DNA|RNA|S)$", "\\1", df$Sample_name), labels=seq_along(unique(tmp))) for instance work for your real data? Commented Mar 21, 2017 at 1:32
  • Hi! Thanks, indeed the solution you provided works to create a new variable Label in my df How do I complete the second part of the question, where all identical labels get a corresponding Pair value (i.e the missing rows in Pair get the same Pair ID S1 or S2 or S3 as the one available value for that group in the original df? Commented Mar 21, 2017 at 1:37

1 Answer 1

1

Try this:

df$x <- gsub("_[^_]+$", "", df$Sample_name)
df$Label <- match(df$x, unique(df$x))
df$Pair <- ave(as.character(df$Pair), df$Label, FUN=max)
df$x <- NULL
df
#      Sample_name Pair Label
# 1  01_00H_NA_DNA   S1     1
# 2  01_00H_NA_RNA   S1     1
# 3    01_00H_NA_S   S1     1
# 4  01_00H_NW_DNA   S2     2
# 5  01_00H_NW_RNA   S2     2
# 6    01_00H_NW_S   S2     2
# 7  01_00H_OM_DNA   S3     3
# 8  01_00H_OM_RNA   S3     3
# 9    01_00H_OM_S   S3     3
# 10 01_00H_RL_DNA   S5     4
# 11 01_00H_RL_RNA   S5     4
# 12   01_00H_RL_S   S5     4

Or using dplyr:

library(dplyr)
df %>%
  mutate(
    x = gsub("_[^_]+$", "", Sample_name),
    Label = match(x, unique(x))
  ) %>%
  select(-x) %>%
  group_by(Label) %>%
  mutate(Pair = paste0(Pair, collapse = "")) %>%
  ungroup()
# # A tibble: 12 × 3
#      Sample_name  Pair Label
#           <fctr> <chr> <int>
# 1  01_00H_NA_DNA    S1     1
# 2  01_00H_NA_RNA    S1     1
# 3    01_00H_NA_S    S1     1
# 4  01_00H_NW_DNA    S2     2
# 5  01_00H_NW_RNA    S2     2
# 6    01_00H_NW_S    S2     2
# 7  01_00H_OM_DNA    S3     3
# 8  01_00H_OM_RNA    S3     3
# 9    01_00H_OM_S    S3     3
# 10 01_00H_RL_DNA    S5     4
# 11 01_00H_RL_RNA    S5     4
# 12   01_00H_RL_S    S5     4

Edit: added @thelatemail's use of ave, better by codegolf and stability.

Sign up to request clarification or add additional context in comments.

5 Comments

ave will be safer than unlist(by(..., which I think relies on the dataset being ordered.
Personally, I prefer the dplyr solution. I'm not as fluent with ave ...
Just ave(as.character(df$Pair), df$Label, FUN=max) will do it.
Hi, thanks, I upvoted and accepted your answer but have a follow up question. What if I were to generate another variable where from the original variable value 01_00H_NA_DNA I now want to make a new variable based on combining similar 1st and 3rd parts of the above string i.e all the 01_NA values be called fix1 in a new variable called new. How would the gsub expression change for that? Thanks!
If you are certain that the strings will always have the same structure of 3+ groups separated by underscore, then you can use mutate(df, name2 = gsub("^([^_]+)_[^_]+_([^_]+)_?.*$", "\\1_\\2", name)). (The last portion, _?.*$, handles a third or more underscore, leading to a fourth or more groups.) An alternative is to use strsplit, but that takes some more work since it returns lists, and is much less direct than this and is therefore a bit slower.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.