How to parse or split string variable into multiple new variables in R

Question

Saw an answer in Python similar for this question but not R so for the sake of being redundant since python is not in my wheel house. Data variable "PublicFilings" contains multiple values that I would like to split into 4 new variables. There are three basic outputs listed below but there will be different combinations of counts for Judgments, Liens and Suits, needless to say bankruptcy is yes no but I'd like that binary. Any thoughts on a simply approach for a data frame? Id can be used as a primary key the combination of No data as an initial output, inability to use comma to separate and desire to convert yes no to binary is throwing me off.

Existing Data 
Id   PublicFilings 
1    Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0 
2    Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0 
3    No Data

No Data implies there was no match to an entity and no Public Filing data was returned

Converted Data 
Id Bankruptcies Judgments Liens Suits 
1  0             0         0     0 
2  1             0         0     0 
3 Null           Null      Null  Null



   df1 <- 
  structure(list(TranId = 1:3, 
                 Name = c("ACME Five,","ACME","WALMART"),
                 Check = c("1234","1234","1235"), 
                 Entity = c("55555","55551","55556"),
                 Match =c("0","0","0"),
                 Score = c("50","60","NA"),
                 Date = c("2019-01-01", "2019-01-02","2019-01-02"),
                 PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 10, Suits: 0", 
                                   "Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", 
                                   "No Data"),
                 Controls =c("2015","2015","1998"),
                 NumEmpoyees = c("5","8","6"),
                 LOB = c("Retail, Food","Retail, Food","Retail, All"),
                 PayScore = c("40","42","NA"),
                 Primary = c("CEO","CEO","CFO"),
                 STARTYear = c("1982","1982","1965"),
                 SpecEvent = c("0","0","0"),
                 Filings =c("0","0","1"),
                 PayExp =c("","","1"
                 )), class = "data.frame", row.names = c(NA, -3L))

View(df1)


library(dplyr)
library(tidyr)
df1 %>%
  separate_rows(PublicFilings, sep = ",\\s+") %>%
  separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
  mutate(key = na_if(key, "No Data"),
         value = as.integer(value %in%  c("Yes", "1"))) %>%
  pivot_wider(names_from = key, values_from = value) %>%
  select(-`NA`)
View(df1)

    # A tibble: 3 x 20
  TranId Name  Check Entity Match Score Date  Controls NumEmpoyees LOB   PayScore Primary STARTYear SpecEvent Filings
   <int> <chr> <chr> <chr>  <chr> <chr> <chr> <chr>    <chr>       <chr> <chr>    <chr>   <chr>     <chr>     <chr>  
1      1 ACME~ 1234  55555  0     50    2019~ 2015     5           Reta~ 40       CEO     1982      0         0      
2      2 ACME  1234  55551  0     60    2019~ 2015     8           Reta~ 42       CEO     1982      0         0      
3      3 WALM~ 1235  55556  0     NA    2019~ 1998     6           Reta~ NA       CFO     1965      0         1      
# ... with 5 more variables: PayExp <chr>, Bankruptcies <int>, Judgments <int>, Liens <int>, Suits <int>
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [9].

the solution needs to return all variables but split out PublicFilings into (4) new variables, missing values can exist in any variable except Id, Check, Entity, these are the only not null fields in the database — Ricky
– Ricky, Commented Jan 16, 2020 at 19:45
It is a warning message. Nothing to worry about. I already knew that issue. It is happening at the row with No Data — akrun
– akrun, Commented Jan 16, 2020 at 19:49
If you want to remove that warnings either can do a small change df1 %>% mutate(PublicFilings = replace(PublicFilings, PublicFilings == 'No Data', 'NoData: No Data')) %>% separate_rows(PublicFilings, sep = ",\\s+") %>% separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>% mutate(key = na_if(key, "No Data"), value = as.integer(value %in% c("Yes", "1"))) %>% pivot_wider(names_from = key, values_from = value) or make some adjustments in the regex pattern — akrun
– akrun, Commented Jan 16, 2020 at 19:53
@akrun the split and pivot is not completing, no new fields are being created — Ricky
– Ricky, Commented Jan 16, 2020 at 19:57
@akrun had to much fun troubleshooting this, had three issues in the PublicFilings field data quality that I did not previously identify and were not part of the sample data effecting like 30 of 13K total observations,in any event got to run finally TY — Ricky
– Ricky, Commented Jan 17, 2020 at 18:07

akrun · Accepted Answer · 2020-01-14 22:46:32Z

1

One option is to split the 'PublicFilings' at the , into 'long' format, then create two columns with separate, and reshape into 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
df1 %>%
     separate_rows(PublicFilings, sep = ",\\s+") %>%
     separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
     mutate(key = na_if(key, "No Data"),
           value = as.integer(value %in%  c("Yes", "1"))) %>%
     pivot_wider(names_from = key, values_from = value) %>%
     select(-`NA`)
#    Id Bankruptcies Judgments Liens Suits
#1  1            0         0     0     0
#2  2            1         0     0     0
#3  3           NA        NA    NA    NA

data

df1 <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0", 
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))

edited Jan 14, 2020 at 22:46

answered Jan 14, 2020 at 20:26

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Ricky Over a year ago

hole new can of worms as pivot_wider function in a an updated version of R and I am unable to update my version, went down the road of trying to install install.packages("devtools") devtools::install_github("hadley/tidyverse") based on different post now not recognizing dpylr package

Ricky Over a year ago

OK just got tidyr and dplyr reinstalled and working, receiving this message akrun Error in spread(key, value) : object 'key' not found

Ricky Over a year ago

tried a few different adjustments to spread substitution without any luck

akrun Over a year ago

@Ricky Can I know the R version you have. I am using R 3.6.2

Ricky Over a year ago

3.4.3 version control at a corporate level

|

ThomasIsCoding · Accepted Answer · 2020-01-14 21:07:33Z

0

Here is a base R solution, where strsplit() + gsub() are used

u <- strsplit(df$PublicFilings,split = ", ")
u[[which(lengths(u)==1)]] <- rep(NA, max(lengths(u)))
M <- do.call(rbind, Map(function(x) gsub(".*:\\s","",x), u))
M[M=="No"] <- 0
M[M=="Yes"] <- 1
dfout <- cbind(df[1],setNames(data.frame(M),gsub(":.*","",u[[which.max(lengths(u))]])))

such that

> dfout
  Id Bankruptcies Judgments Liens Suits
1  1            0         0     0     0
2  2            1         0     0     0
3  3         <NA>      <NA>  <NA>  <NA>

DATA

df <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0", 
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))

answered Jan 14, 2020 at 21:07

ThomasIsCoding

106k9 gold badges38 silver badges110 bronze badges

1 Comment

Ricky Over a year ago

this solution is returning no data in the 4 new data variables vs NA when applied to the broader data set

Matt L. · Accepted Answer · 2024-02-28 21:55:50Z

These answers were helpful in working out my problem- I'm saving my slightly different version which essentially does the same thing, captured into a function which could be polished further.

├ Function to extract ----

extract_keyval <- function(string, sep=","){
  string = gsub("No Data", "", string, ignore.case = T) # handle missing data
  key_value =  unlist(strsplit(string, split = sep)) # vector of key-value pairs  
  vars = gsub("(.*):\\s+.*", "\\1", key_value)    # extract variable name
  values = gsub(".*:\\s+(.*)", "\\1", key_value) # extract values
  values = gsub("yes", "1", values, ignore.case = T)
  values = gsub("no", "0", values, ignore.case = T)
  names(values) = vars # attach names to value
  return(values) # named vector 
  }

I typically use the purrr package to do the transformation back to a dataframe:

├ Tidyverse option ----

df1 %>% mutate(map_df(PublicFilings, ~extract_keyval(.x)))

Collectives™ on Stack Overflow

How to parse or split string variable into multiple new variables in R

3 Answers 3

data

9 Comments

1 Comment

├ Function to extract ----

├ Tidyverse option ----

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

data

9 Comments

1 Comment

├ Function to extract ----

├ Tidyverse option ----

Comments

Your Answer

Sign up or log in

Post as a guest

Related