1

I want to split a column in my dataset into multiple columns based on assigned separated sign: "|".

My dataset looks like this:

vname<-c("x1", "x2", "x3","x4")
label<-c("1,Eng |2,Man", "1,yes|2,no|3,dont know", "1,never|2,sometimes|3,usually|4,always", "1,yes|2,No|3,dont know")
df<-data.frame(vname, label)

So, I want to split column: label in to multiple columns based on sign" "|". I used stringr::str_split to do that, and my code is like:

cd2<-df %>%
  select(vname, everything())%>%
  mutate(label=str_split(value, " \\| "))

However, the result returns a vector in column of label. It looks like this:

vname  label
x1     c("1,Eng","2,Man")
x2     c("1,yes","2,no", "3,dont know")
....

My question is that how to get a expected result like this:

vname  label1   label2      label3          label4
x1     1,Eng    2,Man
x2     1,yes    2,no         3, dont know
x3     1,never. 2,sometimes, 3,usually.     4,always
...

Thanks a lot for help~~~


dput(head(cd2, 10)) 
structure(list(variable = c("x2", "x8", "x9", "x10", "x13", "x14", 
"x15", "x20", "x22", NA), vname = c("consenting_language", "county", 
"respondent", "residence", "language", "int_q1", "int_q2", "int_q4", 
"int_q5", "int_q6"), label = c("Consenting Language", "County", 
"Respondent Type", "Residence", "Interview language ", "1. What was your sex at birth?", 
"2. How would you describe your current sexual orientation?", 
"4. What is the highest level of education you completed?", "5. What is your current marital status?", 
"<div class=\"rich-text-field-label\"><p>6. Is <span style=\"color: #3598db;\">regular </span>your partner currently living with you now, or does s/he stay elsewhere?</p></div>"
), value = c("1, English | 2, Kiswahili", "1, County011 | 2, County014  | 3, County002| 4, County006  | 5, County010 | 6, County008  | 7, County005  | 8, County003 | 9, County012| 10, County004 | 11, County009  | 12, County001 | 13, County015 | 14, County007 | 15, County012", 
"1, FSW | 2, MSM | 3, AGYW", "1, Urban | 2, Peri urban | 3, Rural", 
"1, English | 2, Kiswahili", "1, Male | 2, Female", "1, Homosexual/Gay | 2, Bisexual | 3, Heterosexual/Straight | 4, Transgender Male | 5, Transgender Female | 96, Other | 98, Don't Know | 99, Decline to state", 
"1,None  | 2,Nursery/kindergarten | 3,Primary | 4,Secondary | 5,Tertiary/Vocational | 6,College/University | 7,Adult education | 96,Other", 
"1, Single/Not married | 2, Married | 3, Cohabiting | 4, Divorced | 5, Separated | 6, Widowed | 7, In a relationship", 
"1, Living with You | 2, Staying Elsewhere")), row.names = c(NA, 
10L), class = "data.frame")

2 Answers 2

6

With the code used, it is returning a list (perhaps we have to make sure that there are zero or more spaces as in the example there was no space), we can unnest_wider to convert to new columns

library(dplyr)
library(stringr)
library(tidyr)
df %>%
  select(vname, everything())%>%
  mutate(label=str_split(label, "\\s*\\|\\s*")) %>% 
  unnest_wider(where(is.list), names_sep = "")

-output

# A tibble: 4 × 5
  vname label1  label2      label3      label4  
  <chr> <chr>   <chr>       <chr>       <chr>   
1 x1    1,Eng   2,Man       <NA>        <NA>    
2 x2    1,yes   2,no        3,dont know <NA>    
3 x3    1,never 2,sometimes 3,usually   4,always
4 x4    1,yes   2,No        3,dont know <NA>    

This maybe also done with separate

library(tidyr)
 df %>%
   separate(label, into = str_c('label', 
   seq_len(max(str_count(.$label, fixed("|"))) + 1)), 
      sep = "\\|", fill = "right")

-output

 vname  label1      label2      label3   label4
1    x1  1,Eng        2,Man        <NA>     <NA>
2    x2   1,yes        2,no 3,dont know     <NA>
3    x3 1,never 2,sometimes   3,usually 4,always
4    x4   1,yes        2,No 3,dont know     <NA>

Or using the OP's data 'cd2' - added the case for spaces before and after the |

cd2new <- cd2 %>% 
   separate(value, into = str_c('value', 
   seq_len(max(str_count(.$value, fixed("|"))) + 1)), 
      sep = "\\s*\\|\\s*", fill = "right")

-output

> head(cd2new, 2)
  variable               vname               label       value1       value2       value3       value4       value5
1       x2 consenting_language Consenting Language   1, English 2, Kiswahili         <NA>         <NA>         <NA>
2       x8              county              County 1, County011 2, County014 3, County002 4, County006 5, County010
        value6       value7       value8       value9       value10       value11       value12       value13
1         <NA>         <NA>         <NA>         <NA>          <NA>          <NA>          <NA>          <NA>
2 6, County008 7, County005 8, County003 9, County012 10, County004 11, County009 12, County001 13, County015
        value14       value15
1          <NA>          <NA>
2 14, County007 15, County012
Sign up to request clarification or add additional context in comments.

16 Comments

Thanks a lot @akrun, may I ask why my R tells me that could not find function"where" after I installed all 3 of them? Is the "where" masked by other package? Is it from dplyr package? Thanks a lot~~~!
where is from tidyselect which the dplyr uses. Some packages are automaticlaly installed as dependency when you install dplyr. I am guessing that you may have an older version of packages i.e. I used packageVersion('tidyselect')# [1] ‘1.1.2’ > packageVersion('dplyr')# [1] ‘1.0.9’
Thanks @akrun, I am still figuring how to solve the version issue. For the second method, the R keeps warning me that "Error in stri_count_fixed(string, pattern, opts_fixed = opts(pattern)) : object 'value' not found". I copied your template and only changed the variable name from label to value. Could you help to see where is wrong? Thanks~~~
@Rstudyer value not found is not clear as there is no column name in the example that is value. I used packageVersion('stringr')# [1] ‘1.4.0’
It works perfectly~~~!!! Thanks so much for your help~~!@akrun~~
|
3

you can do this simply by using separate() from {tidyr}

library(tidyverse)

dat %>% as_tibble() %>% 
  separate(value, sep = "\\s*\\|\\s*", 
           into = paste0("value", seq(str_count(.$value, "\\s*\\|\\s*"))))


1 Comment

Thanks @shafee, it is helpful~~! The challenge is the label in my real dataset can be split from 1 to n, so it is hard to set "1:4".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.