Using R to split a string into multiple column instead of a vector in one column

Question

I want to split a column in my dataset into multiple columns based on assigned separated sign: "|".

My dataset looks like this:

vname<-c("x1", "x2", "x3","x4")
label<-c("1,Eng |2,Man", "1,yes|2,no|3,dont know", "1,never|2,sometimes|3,usually|4,always", "1,yes|2,No|3,dont know")
df<-data.frame(vname, label)

So, I want to split column: label in to multiple columns based on sign" "|". I used stringr::str_split to do that, and my code is like:

cd2<-df %>%
  select(vname, everything())%>%
  mutate(label=str_split(value, " \\| "))

However, the result returns a vector in column of label. It looks like this:

vname  label
x1     c("1,Eng","2,Man")
x2     c("1,yes","2,no", "3,dont know")
....

My question is that how to get a expected result like this:

vname  label1   label2      label3          label4
x1     1,Eng    2,Man
x2     1,yes    2,no         3, dont know
x3     1,never. 2,sometimes, 3,usually.     4,always
...

Thanks a lot for help~~~

dput(head(cd2, 10)) 
structure(list(variable = c("x2", "x8", "x9", "x10", "x13", "x14", 
"x15", "x20", "x22", NA), vname = c("consenting_language", "county", 
"respondent", "residence", "language", "int_q1", "int_q2", "int_q4", 
"int_q5", "int_q6"), label = c("Consenting Language", "County", 
"Respondent Type", "Residence", "Interview language ", "1. What was your sex at birth?", 
"2. How would you describe your current sexual orientation?", 
"4. What is the highest level of education you completed?", "5. What is your current marital status?", 
"<div class=\"rich-text-field-label\"><p>6. Is <span style=\"color: #3598db;\">regular </span>your partner currently living with you now, or does s/he stay elsewhere?</p></div>"
), value = c("1, English | 2, Kiswahili", "1, County011 | 2, County014  | 3, County002| 4, County006  | 5, County010 | 6, County008  | 7, County005  | 8, County003 | 9, County012| 10, County004 | 11, County009  | 12, County001 | 13, County015 | 14, County007 | 15, County012", 
"1, FSW | 2, MSM | 3, AGYW", "1, Urban | 2, Peri urban | 3, Rural", 
"1, English | 2, Kiswahili", "1, Male | 2, Female", "1, Homosexual/Gay | 2, Bisexual | 3, Heterosexual/Straight | 4, Transgender Male | 5, Transgender Female | 96, Other | 98, Don't Know | 99, Decline to state", 
"1,None  | 2,Nursery/kindergarten | 3,Primary | 4,Secondary | 5,Tertiary/Vocational | 6,College/University | 7,Adult education | 96,Other", 
"1, Single/Not married | 2, Married | 3, Cohabiting | 4, Divorced | 5, Separated | 6, Widowed | 7, In a relationship", 
"1, Living with You | 2, Staying Elsewhere")), row.names = c(NA, 
10L), class = "data.frame")

akrun · Accepted Answer · 2022-07-06 18:45:28Z

6

With the code used, it is returning a list (perhaps we have to make sure that there are zero or more spaces as in the example there was no space), we can unnest_wider to convert to new columns

library(dplyr)
library(stringr)
library(tidyr)
df %>%
  select(vname, everything())%>%
  mutate(label=str_split(label, "\\s*\\|\\s*")) %>% 
  unnest_wider(where(is.list), names_sep = "")

-output

# A tibble: 4 × 5
  vname label1  label2      label3      label4  
  <chr> <chr>   <chr>       <chr>       <chr>   
1 x1    1,Eng   2,Man       <NA>        <NA>    
2 x2    1,yes   2,no        3,dont know <NA>    
3 x3    1,never 2,sometimes 3,usually   4,always
4 x4    1,yes   2,No        3,dont know <NA>

This maybe also done with separate

library(tidyr)
 df %>%
   separate(label, into = str_c('label', 
   seq_len(max(str_count(.$label, fixed("|"))) + 1)), 
      sep = "\\|", fill = "right")

-output

 vname  label1      label2      label3   label4
1    x1  1,Eng        2,Man        <NA>     <NA>
2    x2   1,yes        2,no 3,dont know     <NA>
3    x3 1,never 2,sometimes   3,usually 4,always
4    x4   1,yes        2,No 3,dont know     <NA>

Or using the OP's data 'cd2' - added the case for spaces before and after the |

cd2new <- cd2 %>% 
   separate(value, into = str_c('value', 
   seq_len(max(str_count(.$value, fixed("|"))) + 1)), 
      sep = "\\s*\\|\\s*", fill = "right")

-output

> head(cd2new, 2)
  variable               vname               label       value1       value2       value3       value4       value5
1       x2 consenting_language Consenting Language   1, English 2, Kiswahili         <NA>         <NA>         <NA>
2       x8              county              County 1, County011 2, County014 3, County002 4, County006 5, County010
        value6       value7       value8       value9       value10       value11       value12       value13
1         <NA>         <NA>         <NA>         <NA>          <NA>          <NA>          <NA>          <NA>
2 6, County008 7, County005 8, County003 9, County012 10, County004 11, County009 12, County001 13, County015
        value14       value15
1          <NA>          <NA>
2 14, County007 15, County012

edited Jul 6, 2022 at 18:45

answered Jul 6, 2022 at 16:24

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Rstudyer Over a year ago

Thanks a lot @akrun, may I ask why my R tells me that could not find function"where" after I installed all 3 of them? Is the "where" masked by other package? Is it from dplyr package? Thanks a lot~~~!

akrun Over a year ago

where is from tidyselect which the dplyr uses. Some packages are automaticlaly installed as dependency when you install dplyr. I am guessing that you may have an older version of packages i.e. I used packageVersion('tidyselect')# [1] ‘1.1.2’ > packageVersion('dplyr')# [1] ‘1.0.9’

Rstudyer Over a year ago

Thanks @akrun, I am still figuring how to solve the version issue. For the second method, the R keeps warning me that "Error in stri_count_fixed(string, pattern, opts_fixed = opts(pattern)) : object 'value' not found". I copied your template and only changed the variable name from label to value. Could you help to see where is wrong? Thanks~~~

akrun Over a year ago

@Rstudyer value not found is not clear as there is no column name in the example that is value. I used packageVersion('stringr')# [1] ‘1.4.0’

Rstudyer Over a year ago

It works perfectly~~~!!! Thanks so much for your help~~!@akrun~~

|

Shafee · Accepted Answer · 2022-07-06 19:03:53Z

3

you can do this simply by using separate() from {tidyr}

library(tidyverse)

dat %>% as_tibble() %>% 
  separate(value, sep = "\\s*\\|\\s*", 
           into = paste0("value", seq(str_count(.$value, "\\s*\\|\\s*"))))

edited Jul 6, 2022 at 19:03

answered Jul 6, 2022 at 16:31

Shafee

20.9k4 gold badges39 silver badges73 bronze badges

1 Comment

Rstudyer Over a year ago

Thanks @shafee, it is helpful~~! The challenge is the label in my real dataset can be split from 1 to n, so it is hard to set "1:4".

Collectives™ on Stack Overflow

Using R to split a string into multiple column instead of a vector in one column

2 Answers 2

16 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

16 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related