4

I am trying to order a variable in R which is a list of file names that contains three substrings that I want to order on. The files names are formatted as such:

MAF001.incMHC.zPGS.S1
MAF002.incMHC.zPGS.S1
MAF003.incMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF002.incMHC.zPGS.S2
MAF003.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF001.noMHC.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.noMHC.zPGS.S2

I want to order this list firstly on MAF substring, then MHC substring, then S substring, such that the order is:

MAF001.incMHC.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S1
MAF001.noMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S2
MAF002.incMHC.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF002.incMHC.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.incMHC.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF003.incMHC.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF003.noMHC.zPGS.S2

I have had a play around with gsub after seeing the answer to this question regarding a single substring: R Sort strings according to substring

But I am not sure how to extend this idea to multiple substrings (of mixed character and numerical classes) within a string.

5 Answers 5

3

Here's a one-liner in base R:

bar <- foo[order(sapply(strsplit(foo, "\\."), function(x) paste(x[1], x[4])))]
head(data.frame(result = bar), 10)

                          result
1          MAF001.incMHC.zPGS.S1
2  MAF001.noMHC_incRS148.zPGS.S1
3           MAF001.noMHC.zPGS.S1
4          MAF001.incMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S2
6           MAF001.noMHC.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8  MAF002.noMHC_incRS148.zPGS.S1
9           MAF002.noMHC.zPGS.S1
10         MAF002.incMHC.zPGS.S2

Explanation:

  • Split string by . using strsplit: strsplit(foo, "\\.")
  • Extract and combine elements 1 and 4: paste(x[1], x[4])
  • Get order of all combinations using order
  • Get corresponding value from foo[]

Data (foo):

c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
"MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
"MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
"MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
"MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
"MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
"MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)
Sign up to request clarification or add additional context in comments.

2 Comments

This works perfectly, thank you. And thank you for the explanation. I am presuming this is automatically ordering by the MHC substring as R already has this substring in the order I want it in?
strsplit(foo, '.', fixed = TRUE) would work as well.
2

Using tidyr and dplyr:

library(tidyr)
library(dplyr)

df <- data.frame(filenames = c(...))

pattern = "^([^.]+)\\.([^.]+)"
df %>%
  extract(filenames, 
          into = c("maf", "mhc"), 
          regex = pattern, remove = FALSE) %>%
  arrange(maf, mhc) %>%
  select(filenames)

Which yields

                       filenames
1          MAF001.incMHC.zPGS.S1
2          MAF001.incMHC.zPGS.S2
3           MAF001.noMHC.zPGS.S1
4           MAF001.noMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S1
6  MAF001.noMHC_incRS148.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8          MAF002.incMHC.zPGS.S2
9           MAF002.noMHC.zPGS.S1
10          MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13         MAF003.incMHC.zPGS.S1
14         MAF003.incMHC.zPGS.S2
15          MAF003.noMHC.zPGS.S1
16          MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2

Comments

1

This result match your desired output, but it only sorts according to MAF and S. I didn't understand how to use MHC string for sorting, please elaborate a bit on that part if this answer doesn't meet your needs.

library(stringr)
maf <- str_extract(filenames, "MAF\\d+\\.")
mhc <- str_extract(filenames, "\\..*MHC.*\\.")
s <- str_extract(filenames, "S\\d+$")

library(magrittr)
library(dplyr)

data.frame(filenames, maf, mhc, s) %>% 
  arrange(maf, s) %>% 
  select(filenames)

the output is:

                       filenames
1          MAF001.incMHC.zPGS.S1
2          MAF001.incMHC.zPGS.S2
3           MAF001.noMHC.zPGS.S1
4           MAF001.noMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S1
6  MAF001.noMHC_incRS148.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8          MAF002.incMHC.zPGS.S2
9           MAF002.noMHC.zPGS.S1
10          MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13         MAF003.incMHC.zPGS.S1
14         MAF003.incMHC.zPGS.S2
15          MAF003.noMHC.zPGS.S1
16          MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2

where filenames is

filenames <- read.table(text="MAF001.incMHC.zPGS.S1
MAF002.incMHC.zPGS.S1
MAF003.incMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF002.incMHC.zPGS.S2
MAF003.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF001.noMHC.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.noMHC.zPGS.S2", header=FALSE, stringsAsFactors=FALSE)

Comments

0

Many good solutions have already been added here. I'm adding another one which is based on use of vector only.

Note: OP intended to sort on MAF, MHC and S substrings. I have stick with that rule to sort all three. Hence result of my answer may not match with other answers.

The approach:

  1. Use regmatches to find substrings per description in OP
  2. Use paste to prepare strings based on which sort can be performed
  3. Set names of vector using setNames
  4. Sort vector on name.

    v[order(names(setNames(v, 
          paste(regmatches(v, regexpr("^MAF\\d+", v, perl = TRUE)),
                regmatches(v, regexpr("\\w*MHC\\w*", v, perl = TRUE)),
                regmatches(v, regexpr("\\w+\\d+$", v, perl = TRUE))
               ))))]
    #Result
    [1] "MAF001.incMHC.zPGS.S1"
    [2] "MAF001.incMHC.zPGS.S2"
    [3] "MAF001.noMHC.zPGS.S1"
    [4] "MAF001.noMHC.zPGS.S2"
    [5] "MAF001.noMHC_incRS148.zPGS.S1"
    [6] "MAF001.noMHC_incRS148.zPGS.S2"
    [7] "MAF002.incMHC.zPGS.S1"
    [8] "MAF002.incMHC.zPGS.S2"
    [9] "MAF002.noMHC.zPGS.S1"
    [10] "MAF002.noMHC.zPGS.S2"
    [11] "MAF002.noMHC_incRS148.zPGS.S1"
    [12] "MAF002.noMHC_incRS148.zPGS.S2"
    [13] "MAF003.incMHC.zPGS.S1"
    [14] "MAF003.incMHC.zPGS.S2"
    [15] "MAF003.noMHC.zPGS.S1"
    [16] "MAF003.noMHC.zPGS.S2"
    [17] "MAF003.noMHC_incRS148.zPGS.S1"
    [18] "MAF003.noMHC_incRS148.zPGS.S2"
    

data

v <- c("MAF001.incMHC.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S1", "MAF001.noMHC.zPGS.S1", 
       "MAF001.incMHC.zPGS.S2", "MAF001.noMHC_incRS148.zPGS.S2", "MAF001.noMHC.zPGS.S2", 
       "MAF002.incMHC.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", "MAF002.noMHC.zPGS.S1", 
       "MAF002.incMHC.zPGS.S2", "MAF002.noMHC_incRS148.zPGS.S2", "MAF002.noMHC.zPGS.S2", 
       "MAF003.incMHC.zPGS.S1", "MAF003.noMHC_incRS148.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
       "MAF003.incMHC.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)

Comments

0

I have a function designed especially for such a task:

function

reg_sort <- function(x,...,verbose=F) {
    ellipsis <-   sapply(as.list(substitute(list(...)))[-1], deparse, simplify="array")
    reg_list <-   paste0(ellipsis, collapse=',')
    reg_list %<>% strsplit(",") %>% unlist %>% gsub("\\\\","\\",.,fixed=T)
    pattern  <-   reg_list %>% map_chr(~sub("^-\\\"","",.) %>% sub("\\\"$","",.) %>% sub("^\\\"","",.) %>% trimws)
    descInd  <-   reg_list %>% map_lgl(~grepl("^-\\\"",.)%>%as.logical)

    reg_extr <-   pattern %>% map(~str_extract(x,.)) %>% c(.,list(x)) %>% as.data.table
    reg_extr[] %<>% lapply(., function(x) type.convert(as.character(x), as.is = TRUE))

    map(rev(seq_along(pattern)),~{reg_extr<<-reg_extr[order(reg_extr[[.]],decreasing = descInd[.])]})

    if(verbose) { tmp<-lapply(reg_extr[,.SD,.SDcols=seq_along(pattern)],unique);names(tmp)<-pattern;tmp %>% print }

    return(reg_extr[[ncol(reg_extr)]])
}

data:

vec <- c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
  "MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
  "MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
  "MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
  "MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
  "MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
  "MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)

call:

reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$")

result: (a character vector)

1          MAF001.incMHC.zPGS.S1
2          MAF001.incMHC.zPGS.S2
3           MAF001.noMHC.zPGS.S1
4           MAF001.noMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S1
6  MAF001.noMHC_incRS148.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8          MAF002.incMHC.zPGS.S2
9           MAF002.noMHC.zPGS.S1
10          MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13         MAF003.incMHC.zPGS.S1
14         MAF003.incMHC.zPGS.S2
15          MAF003.noMHC.zPGS.S1
16          MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2

other features are:

  • Sort descending: (add - infront) reg_sort(x=vec, -"^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)",-"S\\d+$")

  • Verbose mode: reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$",verbose=T) (see/check what the regEx pattern has extracted in order to sort)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.