Split column string elements within a row inside a dataframe

Question

I have a matrix (1000 x 2830) like this:

        9178    3574    3547
160     B_B     B_B      A_A
301     B_B     A_B      A_B
303     B_B     B_B      A_A
311     A_B     A_B      A_A
312     B_B     A_B      A_A
314     B_B     A_B      A_A

and I want to obtain the following (duplicating colnames and splitting each element of each column):

      9178   9178   3574   3574   3547   3547
160     B      B      B      B      A      A
301     B      B      A      B      A      B
303     B      B      B      B      A      A
311     A      B      A      B      A      A
312     B      B      A      B      A      A
314     B      B      A      B      A      A

I tried using strsplit but I got error messages because this is a matrix, not a string. Could you please provide some ideas for resolving this?

None of it looks like a matrix at the moment. Could you try formatting it a little differently? — LauriK
– LauriK, Commented Feb 19, 2015 at 13:21
Consider using a prefix or suffix on the column names like 9178_1 and 9178_2 to avoid duplicated column names (which makes it much more difficult to select the right columns later) — talat
– talat, Commented Feb 19, 2015 at 13:30

talat · Accepted Answer · 2015-02-19 13:42:45Z

7

Here's an option using dplyr (for bind_cols) and tidyr (for separate_) together with lapply from base R. It assumes that your data is a data.frame (i.e. you might need to convert it to data.frame first):

library(dplyr)
library(tidyr)

lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% 
  bind_cols
#  X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2
#1       B       B       B       B       A       A
#2       B       B       A       B       A       B
#3       B       B       B       B       A       A
#4       A       B       A       B       A       A
#5       B       B       A       B       A       A
#6       B       B       A       B       A       A

answered Feb 19, 2015 at 13:42

talat

70.5k22 gold badges130 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

A5C1D2H2I1M1N2O1R2T1 Over a year ago

Seems to miss out on the rownames though--need a slight modification.

talat Over a year ago

@July, are you interested in the original row names? Do they carry any information you require?

July Over a year ago

Yes, rownames are important but I recovered them from matrix used as input for lapply function. Thanks for your interest

talat Over a year ago

If you need the row names, you can just add the following to the existing code: ... %>% mutate(row = rownames(df))

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-02-19 17:25:10Z

I'm biased, but I would recommend using cSplit from my "splitstackshape" package. Since it appears that you have rownames in your input, use as.data.table(., keep.rownames = TRUE):

library(splitstackshape)
cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_")
#     rn X9178_1 X9178_2 X3574_1 X3574_2 X3547_1 X3547_2
# 1: 160       B       B       B       B       A       A
# 2: 301       B       B       A       B       A       B
# 3: 303       B       B       B       B       A       A
# 4: 311       A       B       A       B       A       A
# 5: 312       B       B       A       B       A       A
# 6: 314       B       B       A       B       A       A

Less legible than cSplit (but presently likely to be faster) would be to use stri_split_fixed from "stringi", like this:

library(stringi)
`dimnames<-`(do.call(cbind, 
                     lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), 
             list(rownames(mydf), rep(colnames(mydf), each = 2)))
#     X9178 X9178 X3574 X3574 X3547 X3547
# 160 "B"   "B"   "B"   "B"   "A"   "A"  
# 301 "B"   "B"   "A"   "B"   "A"   "B"  
# 303 "B"   "B"   "B"   "B"   "A"   "A"  
# 311 "A"   "B"   "A"   "B"   "A"   "A"  
# 312 "B"   "B"   "A"   "B"   "A"   "A"  
# 314 "B"   "B"   "A"   "B"   "A"   "A"

If speed is of the essence, I would suggest checking out the "iotools" package, particularly the mstrsplit function. The approach would be similar to the "stringi" approach:

library(iotools)
`dimnames<-`(do.call(cbind, 
                lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")),
             list(rownames(mydf), rep(colnames(mydf), each = 2)))

You may need to add an lapply(mydf, as character) in there if you forgot to use stringsAsFactors = FALSE when converting from a matrix to a data.frame, but it should still beat even the stri_split approach.

The second solution is remarkably fast (see the benchmark below)

Cath · Accepted Answer · 2015-02-19 18:11:48Z

4

Something you can do, although it seems a bit "twisted" (yourmat being your matrix)...:

inter<-data.frame(t(sapply(as.vector(yourmat), function(x) {
                                                 strsplit(x, "_")[[1]]
                                             })),
                   row.names=paste0(rep(colnames(yourmat), e=nrow(yourmat)), 1:nrow(yourmat)),
                   stringsAsFactors=F)
res<-do.call("cbind", 
              split(inter, factor(substr(row.names(inter), 1, 4), level = colnames(yourmat))))
res
#       9178.X1 9178.X2 3574.X1 3574.X2 3547.X1 3547.X2
# 91781       B       B       B       B       A       A
# 91782       B       B       A       B       A       B
# 91783       B       B       B       B       A       A
# 91784       A       B       A       B       A       A
# 91785       B       B       A       B       A       A
# 91786       B       B       A       B       A       A

Edit
If you want the row.names of resto be the same as in yourmat, you can do:

row.names(res)<-row.names(yourmat)

NB: If yourmat is a data.frame instead of a matrix the as.vector function in the first line needs to be changed to unlist.

edited Feb 19, 2015 at 18:11

answered Feb 19, 2015 at 13:44

Cath

24.1k5 gold badges56 silver badges87 bronze badges

8 Comments

July Over a year ago

Thanks a lot, I also tried this option and, although is a bit more complicated than docendo answer, it worked. Thanks again!

Cath Over a year ago

@AnandaMahto, you mean for the inter data.frame ? it's so I can later split the data.frame according to former columns (as it is a matrix, I loose the names when using as.vector)

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@CathG, No. I mean in your results. Where did "91781", "91782" and so on come from when the original rownames were "160", "301" and so on.

Cath Over a year ago

@AnandaMahto, there are coming from the call to cbind and correspond to the row.names of the first element of the list I get from the call to split. And to be honest, I wouldn't have guessed what the row.names would be. But I can change them afterwards to keep the former names

Marat Talipov Over a year ago

It didn't work for me: Error in data.frame(t(sapply(as.vector(yourmat), function(x) { : row names supplied are of the wrong length

|

Marat Talipov · Accepted Answer · 2015-02-19 17:42:32Z

base R solution without using data frames:

# split
z <- unlist(strsplit(m,'_'))
M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m))

# properly order columns
i <- 1:ncol(M)
M <- M[,order(c(i[c(T,F)],i[c(F,T)]))]

# set dimnames
rownames(M) <- rownames(m)
colnames(M) <- rep(colnames(m),each=2)

#    9178  9178  3574  3574  3547  3547
# 160 "B"   "B"   "A"   "B"   "B"   "A"  
# 301 "B"   "A"   "A"   "B"   "B"   "B"  
# 303 "B"   "B"   "A"   "B"   "B"   "A"  
# 311 "A"   "A"   "A"   "B"   "B"   "A"  
# 312 "B"   "A"   "A"   "B"   "B"   "A"  
# 314 "B"   "A"   "A"   "B"   "B"   "A"

[Update] Here is a small benchmarking study of the proposed solutions (I didn't include the cSplit solution because it was too slow):

Setup:

m <- matrix('A_B',nrow=1000,ncol=2830)
d <- as.data.frame(m, stringsAsFactors = FALSE)

##### 
f.mtrx <- function(m) {
  z <- unlist(strsplit(m,'_'))
  M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m))

  # properly order columns
  i <- 1:ncol(M)
  M <- M[,order(c(i[c(T,F)],i[c(F,T)]))]

  # set dimnames
  rownames(M) <- rownames(m)
  colnames(M) <- rep(colnames(m),each=2)
  M
}

library(stringi)
f.mtrx2 <- function(m) {
  z <- unlist(stri_split_fixed(m,'_'))
  M <- matrix(c(z[c(T,F)],z[c(F,T)]),nrow=nrow(m))

  # properly order columns
  i <- 1:ncol(M)
  M <- M[,order(c(i[c(T,F)],i[c(F,T)]))]

  # set dimnames
  rownames(M) <- rownames(m)
  colnames(M) <- rep(colnames(m),each=2)
  M
}

#####
library(splitstackshape)
f.cSplit <- function(mydf) cSplit(as.data.table(mydf, keep.rownames = TRUE), names(mydf), "_")

#####
library(stringi)
f.stringi <- function(mydf) `dimnames<-`(do.call(cbind, 
                     lapply(mydf, stri_split_fixed, "_", simplify = TRUE)), 
             list(rownames(mydf), rep(colnames(mydf), each = 2)))

#####
library(dplyr)
library(tidyr)

f.dplyr <- function(df) lapply(names(df), function(x) separate_(df[x], x, paste0(x,"_",1:2), sep = "_" )) %>% 
  bind_cols

#####
library(iotools)
f.mstrsplit <- function(mydf) `dimnames<-`(do.call(cbind, 
                     lapply(mydf, mstrsplit, "_", ncol = 2, type = "character")),
             list(rownames(mydf), rep(colnames(mydf), each = 2)))



#####
library(rbenchmark)

benchmark(f.mtrx(m), f.mtrx2(m), f.dplyr(d), f.stringi(d), f.mstrsplit(d), replications = 10)

Results:

      test replications elapsed relative user.self sys.self user.child sys.child
3     f.dplyr(d)           10  27.722   10.162    27.360    0.269          0         0
5 f.mstrsplit(d)           10   2.728    1.000     2.607    0.098          0         0
1      f.mtrx(m)           10  37.943   13.909    34.885    0.799          0         0
2     f.mtrx2(m)           10  15.176    5.563    13.936    0.802          0         0
4   f.stringi(d)           10   8.107    2.972     7.815    0.247          0         0

In the updated benchmark, the winner is f.mstrsplit.

It seems to me that the f.stringi is the fastest so it is not clear how the data.table method win here.
If you really want a winner, check out the update at the bottom of my answer.
@AnandaMahto, OP mentioned matrix of 1000x2830 size, so I guess performance is crucial here. Your latest solution, f.mstrsplit(), is a winner on large matrices.

Collectives™ on Stack Overflow

Split column string elements within a row inside a dataframe

4 Answers 4

4 Comments

1 Comment

8 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

8 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related