0

I have the following data:

gene_Id <- c( 'No_id' , 'P1_1_EXN' , 'P1_2_EXN' , 
              'P1_1_EXN_O' , 'P1_2_EXN_O' ,
              'P2_1_EXN' , 'P2_2_EXN' , 
              'P2_1_EXN_O' , 'P2_2_EXN_O' , 
             'P1nM1'  , 'P2nM1')

Count_F <- c(rep('KL',5),rep('KD',6))

DF <- data.frame(gene_Id , Count_F)

I would like to create three additional columns: first_one should replace the cells which have the pattern '_Number_' with 'gene_'Number' for example replace P1_1_EXN with gene_1 , with possibility to control the name of the rest strings that don't match this criterion. also extract the rest of the string after the pattern '_Number_' like: taking only EXN​ in the previous example, and put that in second_one .

third_one should replace any cell which has 'P Number' with 'PREP Number' for example replace P1_1_EXN with PREP _1

EDIT: this is the expected output.

PRER <- c ( 'No_P' ,rep('PREP_1' , 4) , rep('PREP_2' , 4) , 'PREP_1' , 'PREP_2')

Gene_Num <- c ('No_num' , 'gene_1' , 'gene_2' , 'gene_1' , 'gene_2' ,'gene_1',
               'gene_2', 'gene_1', 'gene_2'  , 'NEG' , 'NEG')

Rest <-c('No_rest','EXN','EXN','EXN_O','EXN_O','EXN','EXN','EXN_O','EXN_O', 'Neg','Neg')


New_DF <- cbind(DF,Gene_Num,Rest,PRER)

Thanks a lot in advance.

4
  • Can you include your expected output please? Commented Jun 24, 2020 at 12:31
  • What about 'P1nM1' and 'P2nM1? Commented Jun 24, 2020 at 12:47
  • Yes I added an EDIT Commented Jun 24, 2020 at 12:47
  • @MartinGal this could be replaced with anything, because it doesn't have the pattern ´_Number_´ Commented Jun 24, 2020 at 12:49

2 Answers 2

2

Here is one possibility using the dplyr package and case_when.

DF %>%
  mutate(col1 = case_when(grepl("_\\d_", gene_Id) ~ gsub(".*_(\\d)_.*", "gene_\\1", gene_Id),
                          TRUE ~ "dummy1"),
         col2 = case_when(grepl("_\\d_", gene_Id) ~ gsub("^.*_\\d_", "", gene_Id),
                          TRUE ~ "dummy2"),
         col3 = case_when(grepl("P\\d", gene_Id) ~ gsub(".*P(\\d).*", "PREP_\\1", gene_Id),
                          TRUE ~ "dummmy3"))

      gene_Id Count_F   col1   col2    col3
1       No_id      KL dummy1 dummy2 dummmy3
2    P1_1_EXN      KL gene_1    EXN  PREP_1
3    P1_2_EXN      KL gene_2    EXN  PREP_1
4  P1_1_EXN_O      KL gene_1  EXN_O  PREP_1
5  P1_2_EXN_O      KL gene_2  EXN_O  PREP_1
6    P2_1_EXN      KD gene_1    EXN  PREP_2
7    P2_2_EXN      KD gene_2    EXN  PREP_2
8  P2_1_EXN_O      KD gene_1  EXN_O  PREP_2
9  P2_2_EXN_O      KD gene_2  EXN_O  PREP_2
10      P1nM1      KD dummy1 dummy2  PREP_1
11      P2nM1      KD dummy1 dummy2  PREP_2

Here is a little explanation: first I check whether the desired substring is contained in gene_ID using grepl. If yes, I extract it according to the rules. If not, I assign a dummy value (I named those dummy1, dummy2 and dummy3).

I use regular expression to match the strings: \\d matches a digit and _\\d_ matches a digit between two underscores. When using gsub \\1 refers to what ever was matched in the first paranthesis: in this case it is always a digit.

So for example the definition of col1 works like this:

  1. check if you find the pattern _\\d_ inside gene_ID: if yes replace the whole string with gene_\\1 where \\1 is the digit between the underscores.
  2. If you do not find the pattern _\\d_ assign "dummy1".
Sign up to request clarification or add additional context in comments.

2 Comments

would it be possible to use two different dummy in col2, for example: the string No_id could have dummy4 and P1nM1 : dummy5 and P1nM1 : dummy6
yes, case_when makes this possible, simply add a second logical expression to distinguish between the dummy values. See also ?dplyr::case_when for more info.
2

An alternative using dplyr and stringr:

DF %>%
  mutate(Gene     = str_c("gene", str_extract(gene_Id, "_\\d(?=_)")),
         Rest     = str_extract(gene_Id, "(?<=_\\d_).*"),
         P_Number = str_replace(str_extract(gene_Id, "P\\d"), "P", "PREP_"))

returns

      gene_Id Count_F   Gene  Rest P_Number
1       No_id      KL   <NA>  <NA>     <NA>
2    P1_1_EXN      KL gene_1   EXN   PREP_1
3    P1_2_EXN      KL gene_2   EXN   PREP_1
4  P1_1_EXN_O      KL gene_1 EXN_O   PREP_1
5  P1_2_EXN_O      KL gene_2 EXN_O   PREP_1
6    P2_1_EXN      KD gene_1   EXN   PREP_2
7    P2_2_EXN      KD gene_2   EXN   PREP_2
8  P2_1_EXN_O      KD gene_1 EXN_O   PREP_2
9  P2_2_EXN_O      KD gene_2 EXN_O   PREP_2
10      P1nM1      KD   <NA>  <NA>   PREP_1
11      P2nM1      KD   <NA>  <NA>   PREP_2

I didn't include a handle for the <NA>-cases.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.