Create new columns based on the content of strings of another column

Question

I have the following data:

gene_Id <- c( 'No_id' , 'P1_1_EXN' , 'P1_2_EXN' , 
              'P1_1_EXN_O' , 'P1_2_EXN_O' ,
              'P2_1_EXN' , 'P2_2_EXN' , 
              'P2_1_EXN_O' , 'P2_2_EXN_O' , 
             'P1nM1'  , 'P2nM1')

Count_F <- c(rep('KL',5),rep('KD',6))

DF <- data.frame(gene_Id , Count_F)

I would like to create three additional columns: first_one should replace the cells which have the pattern '_Number_' with 'gene_'Number' for example replace P1_1_EXN with gene_1 , with possibility to control the name of the rest strings that don't match this criterion. also extract the rest of the string after the pattern '_Number_' like: taking only EXN in the previous example, and put that in second_one .

third_one should replace any cell which has 'P Number' with 'PREP Number' for example replace P1_1_EXN with PREP _1

EDIT: this is the expected output.

PRER <- c ( 'No_P' ,rep('PREP_1' , 4) , rep('PREP_2' , 4) , 'PREP_1' , 'PREP_2')

Gene_Num <- c ('No_num' , 'gene_1' , 'gene_2' , 'gene_1' , 'gene_2' ,'gene_1',
               'gene_2', 'gene_1', 'gene_2'  , 'NEG' , 'NEG')

Rest <-c('No_rest','EXN','EXN','EXN_O','EXN_O','EXN','EXN','EXN_O','EXN_O', 'Neg','Neg')


New_DF <- cbind(DF,Gene_Num,Rest,PRER)

Thanks a lot in advance.

@MartinGal this could be replaced with anything, because it doesn't have the pattern ´_Number_´ — Sam_9090
– Sam_9090, Commented Jun 24, 2020 at 12:49

Cettt · Accepted Answer · 2020-06-24 12:57:56Z

2

Here is one possibility using the dplyr package and case_when.

DF %>%
  mutate(col1 = case_when(grepl("_\\d_", gene_Id) ~ gsub(".*_(\\d)_.*", "gene_\\1", gene_Id),
                          TRUE ~ "dummy1"),
         col2 = case_when(grepl("_\\d_", gene_Id) ~ gsub("^.*_\\d_", "", gene_Id),
                          TRUE ~ "dummy2"),
         col3 = case_when(grepl("P\\d", gene_Id) ~ gsub(".*P(\\d).*", "PREP_\\1", gene_Id),
                          TRUE ~ "dummmy3"))

      gene_Id Count_F   col1   col2    col3
1       No_id      KL dummy1 dummy2 dummmy3
2    P1_1_EXN      KL gene_1    EXN  PREP_1
3    P1_2_EXN      KL gene_2    EXN  PREP_1
4  P1_1_EXN_O      KL gene_1  EXN_O  PREP_1
5  P1_2_EXN_O      KL gene_2  EXN_O  PREP_1
6    P2_1_EXN      KD gene_1    EXN  PREP_2
7    P2_2_EXN      KD gene_2    EXN  PREP_2
8  P2_1_EXN_O      KD gene_1  EXN_O  PREP_2
9  P2_2_EXN_O      KD gene_2  EXN_O  PREP_2
10      P1nM1      KD dummy1 dummy2  PREP_1
11      P2nM1      KD dummy1 dummy2  PREP_2

Here is a little explanation: first I check whether the desired substring is contained in gene_ID using grepl. If yes, I extract it according to the rules. If not, I assign a dummy value (I named those dummy1, dummy2 and dummy3).

I use regular expression to match the strings: \\d matches a digit and _\\d_ matches a digit between two underscores. When using gsub \\1 refers to what ever was matched in the first paranthesis: in this case it is always a digit.

So for example the definition of col1 works like this:

check if you find the pattern _\\d_ inside gene_ID: if yes replace the whole string with gene_\\1 where \\1 is the digit between the underscores.
If you do not find the pattern _\\d_ assign "dummy1".

edited Jun 24, 2020 at 12:57

answered Jun 24, 2020 at 12:50

Cettt

12k8 gold badges40 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sam_9090 Over a year ago

would it be possible to use two different dummy in col2, for example: the string No_id could have dummy4 and P1nM1 : dummy5 and P1nM1 : dummy6

Cettt Over a year ago

yes, case_when makes this possible, simply add a second logical expression to distinguish between the dummy values. See also ?dplyr::case_when for more info.

Martin Gal · Accepted Answer · 2020-06-24 12:58:42Z

An alternative using dplyr and stringr:

DF %>%
  mutate(Gene     = str_c("gene", str_extract(gene_Id, "_\\d(?=_)")),
         Rest     = str_extract(gene_Id, "(?<=_\\d_).*"),
         P_Number = str_replace(str_extract(gene_Id, "P\\d"), "P", "PREP_"))

returns

      gene_Id Count_F   Gene  Rest P_Number
1       No_id      KL   <NA>  <NA>     <NA>
2    P1_1_EXN      KL gene_1   EXN   PREP_1
3    P1_2_EXN      KL gene_2   EXN   PREP_1
4  P1_1_EXN_O      KL gene_1 EXN_O   PREP_1
5  P1_2_EXN_O      KL gene_2 EXN_O   PREP_1
6    P2_1_EXN      KD gene_1   EXN   PREP_2
7    P2_2_EXN      KD gene_2   EXN   PREP_2
8  P2_1_EXN_O      KD gene_1 EXN_O   PREP_2
9  P2_2_EXN_O      KD gene_2 EXN_O   PREP_2
10      P1nM1      KD   <NA>  <NA>   PREP_1
11      P2nM1      KD   <NA>  <NA>   PREP_2

I didn't include a handle for the <NA>-cases.

Collectives™ on Stack Overflow

Create new columns based on the content of strings of another column

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related