0

I would like to do the following in R (but am open to suggestions in bash): I have a long list of elements (20,000) that are part of 80 groups. Each group starts with the same string before the underscore delimiter. I want to split the column of all elements into a new data frame containing 80 columns, according to the pattern before the underscore. The columns will have different sizes, so NA values are acceptable.

E.g. the column I want to split:

head(df$V1)

FOO1_Yu
FOO1_uN
FOO2_Yo
FOO2_yA
FOO10_nO
FOO10_Yes
FOO1_NoY

Desired outcome (a new df, with headers included in the first row):

head(df2)
FOO1    FOO2    FOO10
FOO1_Yu FOO2_Yo FOO10_nO
FOO1_uN FOO2_yA FOO10_Yes
FOO1_NoY        

Any ideas? (And thanks in advance!)

1 Answer 1

1

The following uses the reshape2 package to get the results you're looking for. Note that since columns are cast into a long-format data.frame, where missing values exist, they're replaced with NAs (your question shows blank spaces where columns have two vs thee elements, but a true blank isn't possible in a data.frame as all rows need to filled with something, in this case NA where blank). The approach is as follows: (1) use str_split to split your name/value pairs by "_" and return these to a data frame (2) use dcast where the name value is function of your value string

library(reshape2)
head(df$V1)

df <- data.frame(V1=c("FOO1_Yu","FOO1_uN","FOO2_Yo","FOO2_yA","FOO10_nO","FOO10_Yes","FOO1_NoY"),stringsAsFactors = F)

splits <- lapply(df$V1,function(x)
  {
    if (!grepl("_",x)) 
    {
      print(paste("Skipping bad input=",x)) 
      return (NULL)
    } else { 
      pair <- unlist(strsplit(x,split="_"))
      name <- pair[1]
      value <- x
      return (data.frame(name=name,value=value)) 
    }
  })

splits <- do.call("rbind",splits)

df <- dcast(splits,value ~ name)

The output results as follows:

      value     FOO1    FOO2     FOO10
1   FOO1_Yu  FOO1_Yu    <NA>      <NA>
2   FOO1_uN  FOO1_uN    <NA>      <NA>
3   FOO2_Yo     <NA> FOO2_Yo      <NA>
4   FOO2_yA     <NA> FOO2_yA      <NA>
5  FOO10_nO     <NA>    <NA>  FOO10_nO
6 FOO10_Yes     <NA>    <NA> FOO10_Yes
7  FOO1_NoY FOO1_NoY    <NA>      <NA>
Sign up to request clarification or add additional context in comments.

3 Comments

Hey Soren, this is perfect! Thanks a lot!
With pleasure. You may also consider replacing above: "value <- x" with "value <- pair[2]" to just get the value-element. Although the OP suggested need as proposed in the answer.
Oh, I tried playing with this now because I also needed a table with 1/0s instead of element names and NAs. But then I couldn't really figure out how to adapt your code so I was just cheeky and used df[!is.na(df)]<-1 and df[is.na(df)]<-0.. But your R code really bugged my head - I must improve my R game haha. Thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.