7

I have a matrix (1000 x 2830) like this:

        9178    3574    3547
160     B_B     B_B      A_A
301     B_B     A_B      A_B
303     B_B     B_B      A_A
311     A_B     A_B      A_A
312     B_B     A_B      A_A
314     B_B     A_B      A_A

and I want to obtain the following (duplicating colnames and splitting each element of each column):

      9178   9178   3574   3574   3547   3547
160     B      B      B      B      A      A
301     B      B      A      B      A      B
303     B      B      B      B      A      A
311     A      B      A      B      A      A
312     B      B      A      B      A      A
314     B      B      A      B      A      A

I tried using strsplit but I got error messages because this is a matrix, not a string. Could you please provide some ideas for resolving this?

2
  • None of it looks like a matrix at the moment. Could you try formatting it a little differently? Commented Feb 19, 2015 at 13:21
  • 1
    Consider using a prefix or suffix on the column names like 9178_1 and 9178_2 to avoid duplicated column names (which makes it much more difficult to select the right columns later) Commented Feb 19, 2015 at 13:30

4 Answers 4

7

Here's an option using dplyr (for bind_cols) and tidyr (for separate_) together with lapply from base R. It assumes that your data is a data.frame (i.e. you might need to convert it to data.frame first):

library(dplyr)
library(tidyr)

lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% 
  bind_cols
#  X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2
#1       B       B       B       B       A       A
#2       B       B       A       B       A       B
#3       B       B       B       B       A       A
#4       A       B       A       B       A       A
#5       B       B       A       B       A       A
#6       B       B       A       B       A       A
Sign up to request clarification or add additional context in comments.

4 Comments

Seems to miss out on the rownames though--need a slight modification.
@July, are you interested in the original row names? Do they carry any information you require?
Yes, rownames are important but I recovered them from matrix used as input for lapply function. Thanks for your interest
If you need the row names, you can just add the following to the existing code: ... %>% mutate(row = rownames(df))
6

I'm biased, but I would recommend using cSplit from my "splitstackshape" package. Since it appears that you have rownames in your input, use as.data.table(., keep.rownames = TRUE):

library(splitstackshape)
cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_")
#     rn X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2
# 1: 160       B       B       B       B       A       A
# 2: 301       B       B       A       B       A       B
# 3: 303       B       B       B       B       A       A
# 4: 311       A       B       A       B       A       A
# 5: 312       B       B       A       B       A       A
# 6: 314       B       B       A       B       A       A

Less legible than cSplit (but presently likely to be faster) would be to use stri_split_fixed from "stringi", like this:

library(stringi)
`dimnames<-`(do.call(cbind, 
                     lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), 
             list(rownames(mydf), rep(colnames(mydf), each = 2)))
#     X9178 X9178 X3574 X3574 X3547 X3547
# 160 "B"   "B"   "B"   "B"   "A"   "A"  
# 301 "B"   "B"   "A"   "B"   "A"   "B"  
# 303 "B"   "B"   "B"   "B"   "A"   "A"  
# 311 "A"   "B"   "A"   "B"   "A"   "A"  
# 312 "B"   "B"   "A"   "B"   "A"   "A"  
# 314 "B"   "B"   "A"   "B"   "A"   "A" 

If speed is of the essence, I would suggest checking out the "iotools" package, particularly the mstrsplit function. The approach would be similar to the "stringi" approach:

library(iotools)
`dimnames<-`(do.call(cbind, 
                lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")),
             list(rownames(mydf), rep(colnames(mydf), each = 2)))

You may need to add an lapply(mydf, as character) in there if you forgot to use stringsAsFactors = FALSE when converting from a matrix to a data.frame, but it should still beat even the stri_split approach.

1 Comment

The second solution is remarkably fast (see the benchmark below)
4

Something you can do, although it seems a bit "twisted" (yourmat being your matrix)...:

inter<-data.frame(t(sapply(as.vector(yourmat), function(x) {
                                                 strsplit(x, "_")[[1]]
                                             })),
                   row.names=paste0(rep(colnames(yourmat), e=nrow(yourmat)), 1:nrow(yourmat)),
                   stringsAsFactors=F)
res<-do.call("cbind", 
              split(inter, factor(substr(row.names(inter), 1, 4), level = colnames(yourmat))))
res
#       9178.X1 9178.X2 3574.X1 3574.X2 3547.X1 3547.X2
# 91781       B       B       B       B       A       A
# 91782       B       B       A       B       A       B
# 91783       B       B       B       B       A       A
# 91784       A       B       A       B       A       A
# 91785       B       B       A       B       A       A
# 91786       B       B       A       B       A       A

Edit
If you want the row.names of resto be the same as in yourmat, you can do:

row.names(res)<-row.names(yourmat)

NB: If yourmat is a data.frame instead of a matrix the as.vector function in the first line needs to be changed to unlist.

8 Comments

Thanks a lot, I also tried this option and, although is a bit more complicated than docendo answer, it worked. Thanks again!
@AnandaMahto, you mean for the inter data.frame ? it's so I can later split the data.frame according to former columns (as it is a matrix, I loose the names when using as.vector)
@CathG, No. I mean in your results. Where did "91781", "91782" and so on come from when the original rownames were "160", "301" and so on.
@AnandaMahto, there are coming from the call to cbind and correspond to the row.names of the first element of the list I get from the call to split. And to be honest, I wouldn't have guessed what the row.names would be. But I can change them afterwards to keep the former names
It didn't work for me: Error in data.frame(t(sapply(as.vector(yourmat), function(x) { : row names supplied are of the wrong length
|
2

base R solution without using data frames:

# split
z <- unlist(strsplit(m,'_'))
M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m))

# properly order columns
i <- 1:ncol(M)
M <- M[,order(c(i[c(T,F)],i[c(F,T)]))]

# set dimnames
rownames(M) <- rownames(m)
colnames(M) <- rep(colnames(m),each=2)

#    9178  9178  3574  3574  3547  3547
# 160 "B"   "B"   "A"   "B"   "B"   "A"  
# 301 "B"   "A"   "A"   "B"   "B"   "B"  
# 303 "B"   "B"   "A"   "B"   "B"   "A"  
# 311 "A"   "A"   "A"   "B"   "B"   "A"  
# 312 "B"   "A"   "A"   "B"   "B"   "A"  
# 314 "B"   "A"   "A"   "B"   "B"   "A"  

[Update] Here is a small benchmarking study of the proposed solutions (I didn't include the cSplit solution because it was too slow):

Setup:

m <- matrix('A_B',nrow=1000,ncol=2830)
d <- as.data.frame(m, stringsAsFactors = FALSE)

##### 
f.mtrx <- function(m) {
  z <- unlist(strsplit(m,'_'))
  M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m))

  # properly order columns
  i <- 1:ncol(M)
  M <- M[,order(c(i[c(T,F)],i[c(F,T)]))]

  # set dimnames
  rownames(M) <- rownames(m)
  colnames(M) <- rep(colnames(m),each=2)
  M
}

library(stringi)
f.mtrx2 <- function(m) {
  z <- unlist(stri_split_fixed(m,'_'))
  M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m))

  # properly order columns
  i <- 1:ncol(M)
  M <- M[,order(c(i[c(T,F)],i[c(F,T)]))]

  # set dimnames
  rownames(M) <- rownames(m)
  colnames(M) <- rep(colnames(m),each=2)
  M
}

#####
library(splitstackshape)
f.cSplit <- function(mydf) cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_")

#####
library(stringi)
f.stringi <- function(mydf) `dimnames<-`(do.call(cbind, 
                     lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), 
             list(rownames(mydf), rep(colnames(mydf), each = 2)))

#####
library(dplyr)
library(tidyr)

f.dplyr <- function(df) lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% 
  bind_cols

#####
library(iotools)
f.mstrsplit <- function(mydf) `dimnames<-`(do.call(cbind, 
                     lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")),
             list(rownames(mydf), rep(colnames(mydf), each = 2)))



#####
library(rbenchmark)

benchmark(f.mtrx(m), f.mtrx2(m), f.dplyr(d), f.stringi(d), f.mstrsplit(d), replications = 10)

Results:

      test replications elapsed relative user.self sys.self user.child sys.child
3     f.dplyr(d)           10  27.722   10.162    27.360    0.269          0         0
5 f.mstrsplit(d)           10   2.728    1.000     2.607    0.098          0         0
1      f.mtrx(m)           10  37.943   13.909    34.885    0.799          0         0
2     f.mtrx2(m)           10  15.176    5.563    13.936    0.802          0         0
4   f.stringi(d)           10   8.107    2.972     7.815    0.247          0         0

In the updated benchmark, the winner is f.mstrsplit.

3 Comments

It seems to me that the f.stringi is the fastest so it is not clear how the data.table method win here.
If you really want a winner, check out the update at the bottom of my answer.
@AnandaMahto, OP mentioned matrix of 1000x2830 size, so I guess performance is crucial here. Your latest solution, f.mstrsplit(), is a winner on large matrices.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.