2

I have a matrix with multiple individuals in rows and multiple nucleotides (values) in columns. It looks like this:

     [,1][,2][,3][,4] ...
ind1   a   c   a   a
ind2   a   c   t   t
ind3   a   g   g   c
ind4   a   g   g   g
.
.
.

Now I would like to ignore all columns where only one value occurs (as in the example above the first column) and convert every column with two, three and four (no more than 4 is possible!) different nucleotides (values) into binary format. In the end it should look like this:

     [,1] [,2]  [,3] ...
ind1  10   100   1000
ind2  10   010   0100
ind3  01   001   0010
ind4  01   001   0001
.
.
.

For me it is only important to get the same binary code for if there are two, three or four different values. I was already calculating how many different values in each column occur, but I am not sure how to change the values to binary format:

df <- apply(df, 2, function(x) length(unique(x)))

Can someone help me?

2
  • Not clear how you are getting the expected output of '01' '10' where the column values are just 'a' for the first column Commented Oct 20, 2020 at 23:39
  • Not clear about the expected output. May be library(pryr);apply(df[-1], 2, function(x) {n <- length(unique(x)); substr(pryr::bits(x), n, n + n-1)}) Commented Oct 20, 2020 at 23:58

2 Answers 2

2

Here is something else to try. A custom function will take each column through apply. First, you can create a vector of numeric values corresponding to unique characters in the column (unique is used, as factor will otherwise alphabetize the order). A string of zeroes the length of the maximum number will be made, and the character position corresponding to each value will then be substituted with "1".

my_fun <- function(x) {
  vec <- as.numeric(factor(x, levels = unique(x)))
  vec_max <- max(na.omit(vec))
  lapply(vec, 
         function(y) ifelse(!is.na(y), 
                            sub(paste0("(.{", y - 1, "})."), 
                                "\\11", 
                                paste0(rep("0", vec_max), collapse = "")), 
                            NA))
}

m[] <- matrix(unlist(apply(m, 2, my_fun)))

Output

     [,1] [,2] [,3]  [,4]  
ind1 "1"  "10" "100" "1000"
ind2 "1"  "10" "010" "0100"
ind3 "1"  "01" "001" "0010"
ind4 "1"  "01" "001" "0001"

Data

m <- structure(c("a", "a", "a", "a", "c", "c", "g", "g", "a", "t", 
"g", "g", "a", "t", "c", "g"), .Dim = c(4L, 4L), .Dimnames = list(
    c("ind1", "ind2", "ind3", "ind4"), NULL))
Sign up to request clarification or add additional context in comments.

5 Comments

thanks a lot for helping me, but I get this error: Error in rep("0", max(vec)) : invalid 'times' argument
Hi, I tried with my own data. It is too big for dput(head(df)), it has 543 rows and ~11000 columns, maybe that's the reason?
Yes I do have, but I was using [!is.na(x)].Wouldn't this work? And when pasting HAs in a small matrix and running your command it works.
I would like to ignore the NAs. Is this possible?
@LukasMe See edited answer - let me know if this helps. If a value is missing, it will use NA in result. Also, the length of the binary result will be shorter too. I need to run but will check in again later in a few hours.
1

Here is my attempt:

r1 <- c("a","c","a","a")
r2 <- c("a","c","t","t")
r3 <- c("a","g","g","c")
r4 <- c("a","g","g","g")

n.mat <- rbind(r1,r2,r3,r4)

number_to_nucleotide_binary <- function(x,len) {
  out <- rep("0",len)
  out[x] <- "1"
  return(paste(out,collapse = ""))
}

nuc_to_binary <- function(x) {
  
  len <- length(unique(x))
  char <- sort(unique(x))
  out <- x
  
  if(len != 1) {
    pos <- match(x,char)
    out <- sapply(X = pos,FUN = function(x) {number_to_nucleotide_binary(x = x,len = len)})
  }
  
  return(out)
}

apply(X = n.mat,FUN = nuc_to_binary,MARGIN = 2)

Input:

   [,1] [,2] [,3] [,4]
r1 "a"  "c"  "a"  "a" 
r2 "a"  "c"  "t"  "t" 
r3 "a"  "g"  "g"  "c" 
r4 "a"  "g"  "g"  "g" 

Output:

     [,1] [,2] [,3]  [,4]  
[1,] "a"  "10" "100" "1000"
[2,] "a"  "10" "001" "0001"
[3,] "a"  "01" "010" "0100"
[4,] "a"  "01" "010" "0010"

2 Comments

Hi, thanks a lot for helping me, but I would like to have only 1000 , 0100 , 0010 , 0001 but no 0011for example, but besides that your command worked quite nice!
I've edited the code to be simpler and remove dependence on the other SO answer. Hopefully this is what you were after - although Ben's answer is more scalable / elegant / concise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.