Split a column into multiple columns based on string pattern (before delimiter)

Question

I would like to do the following in R (but am open to suggestions in bash): I have a long list of elements (20,000) that are part of 80 groups. Each group starts with the same string before the underscore delimiter. I want to split the column of all elements into a new data frame containing 80 columns, according to the pattern before the underscore. The columns will have different sizes, so NA values are acceptable.

E.g. the column I want to split:

head(df$V1)

FOO1_Yu
FOO1_uN
FOO2_Yo
FOO2_yA
FOO10_nO
FOO10_Yes
FOO1_NoY

Desired outcome (a new df, with headers included in the first row):

head(df2)
FOO1    FOO2    FOO10
FOO1_Yu FOO2_Yo FOO10_nO
FOO1_uN FOO2_yA FOO10_Yes
FOO1_NoY

Any ideas? (And thanks in advance!)

Soren · Accepted Answer · 2019-03-04 17:51:23Z

1

The following uses the reshape2 package to get the results you're looking for. Note that since columns are cast into a long-format data.frame, where missing values exist, they're replaced with NAs (your question shows blank spaces where columns have two vs thee elements, but a true blank isn't possible in a data.frame as all rows need to filled with something, in this case NA where blank). The approach is as follows: (1) use str_split to split your name/value pairs by "_" and return these to a data frame (2) use dcast where the name value is function of your value string

library(reshape2)
head(df$V1)

df <- data.frame(V1=c("FOO1_Yu","FOO1_uN","FOO2_Yo","FOO2_yA","FOO10_nO","FOO10_Yes","FOO1_NoY"),stringsAsFactors = F)

splits <- lapply(df$V1,function(x)
  {
    if (!grepl("_",x)) 
    {
      print(paste("Skipping bad input=",x)) 
      return (NULL)
    } else { 
      pair <- unlist(strsplit(x,split="_"))
      name <- pair[1]
      value <- x
      return (data.frame(name=name,value=value)) 
    }
  })

splits <- do.call("rbind",splits)

df <- dcast(splits,value ~ name)

The output results as follows:

      value     FOO1    FOO2     FOO10
1   FOO1_Yu  FOO1_Yu    <NA>      <NA>
2   FOO1_uN  FOO1_uN    <NA>      <NA>
3   FOO2_Yo     <NA> FOO2_Yo      <NA>
4   FOO2_yA     <NA> FOO2_yA      <NA>
5  FOO10_nO     <NA>    <NA>  FOO10_nO
6 FOO10_Yes     <NA>    <NA> FOO10_Yes
7  FOO1_NoY FOO1_NoY    <NA>      <NA>

answered Mar 4, 2019 at 17:51

Soren

2,4851 gold badge18 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rodrigo Duarte Over a year ago

Hey Soren, this is perfect! Thanks a lot!

Soren Over a year ago

With pleasure. You may also consider replacing above: "value <- x" with "value <- pair[2]" to just get the value-element. Although the OP suggested need as proposed in the answer.

Rodrigo Duarte Over a year ago

Oh, I tried playing with this now because I also needed a table with 1/0s instead of element names and NAs. But then I couldn't really figure out how to adapt your code so I was just cheeky and used df[!is.na(df)]<-1 and df[is.na(df)]<-0.. But your R code really bugged my head - I must improve my R game haha. Thanks a lot!

Collectives™ on Stack Overflow

Split a column into multiple columns based on string pattern (before delimiter)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related