3

Saw an answer in Python similar for this question but not R so for the sake of being redundant since python is not in my wheel house. Data variable "PublicFilings" contains multiple values that I would like to split into 4 new variables. There are three basic outputs listed below but there will be different combinations of counts for Judgments, Liens and Suits, needless to say bankruptcy is yes no but I'd like that binary. Any thoughts on a simply approach for a data frame? Id can be used as a primary key the combination of No data as an initial output, inability to use comma to separate and desire to convert yes no to binary is throwing me off.

Existing Data 
Id   PublicFilings 
1    Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0 
2    Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0 
3    No Data

No Data implies there was no match to an entity and no Public Filing data was returned

Converted Data 
Id Bankruptcies Judgments Liens Suits 
1  0             0         0     0 
2  1             0         0     0 
3 Null           Null      Null  Null



   df1 <- 
  structure(list(TranId = 1:3, 
                 Name = c("ACME Five,","ACME","WALMART"),
                 Check = c("1234","1234","1235"), 
                 Entity = c("55555","55551","55556"),
                 Match =c("0","0","0"),
                 Score = c("50","60","NA"),
                 Date = c("2019-01-01", "2019-01-02","2019-01-02"),
                 PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 10, Suits: 0", 
                                   "Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", 
                                   "No Data"),
                 Controls =c("2015","2015","1998"),
                 NumEmpoyees = c("5","8","6"),
                 LOB = c("Retail, Food","Retail, Food","Retail, All"),
                 PayScore = c("40","42","NA"),
                 Primary = c("CEO","CEO","CFO"),
                 STARTYear = c("1982","1982","1965"),
                 SpecEvent = c("0","0","0"),
                 Filings =c("0","0","1"),
                 PayExp =c("","","1"
                 )), class = "data.frame", row.names = c(NA, -3L))

View(df1)


library(dplyr)
library(tidyr)
df1 %>%
  separate_rows(PublicFilings, sep = ",\\s+") %>%
  separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
  mutate(key = na_if(key, "No Data"),
         value = as.integer(value %in%  c("Yes", "1"))) %>%
  pivot_wider(names_from = key, values_from = value) %>%
  select(-`NA`)
View(df1)

    # A tibble: 3 x 20
  TranId Name  Check Entity Match Score Date  Controls NumEmpoyees LOB   PayScore Primary STARTYear SpecEvent Filings
   <int> <chr> <chr> <chr>  <chr> <chr> <chr> <chr>    <chr>       <chr> <chr>    <chr>   <chr>     <chr>     <chr>  
1      1 ACME~ 1234  55555  0     50    2019~ 2015     5           Reta~ 40       CEO     1982      0         0      
2      2 ACME  1234  55551  0     60    2019~ 2015     8           Reta~ 42       CEO     1982      0         0      
3      3 WALM~ 1235  55556  0     NA    2019~ 1998     6           Reta~ NA       CFO     1965      0         1      
# ... with 5 more variables: PayExp <chr>, Bankruptcies <int>, Judgments <int>, Liens <int>, Suits <int>
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [9].
7
  • the solution needs to return all variables but split out PublicFilings into (4) new variables, missing values can exist in any variable except Id, Check, Entity, these are the only not null fields in the database Commented Jan 16, 2020 at 19:45
  • It is a warning message. Nothing to worry about. I already knew that issue. It is happening at the row with No Data Commented Jan 16, 2020 at 19:49
  • If you want to remove that warnings either can do a small change df1 %>% mutate(PublicFilings = replace(PublicFilings, PublicFilings == 'No Data', 'NoData: No Data')) %>% separate_rows(PublicFilings, sep = ",\\s+") %>% separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>% mutate(key = na_if(key, "No Data"), value = as.integer(value %in% c("Yes", "1"))) %>% pivot_wider(names_from = key, values_from = value) or make some adjustments in the regex pattern Commented Jan 16, 2020 at 19:53
  • @akrun the split and pivot is not completing, no new fields are being created Commented Jan 16, 2020 at 19:57
  • 1
    @akrun had to much fun troubleshooting this, had three issues in the PublicFilings field data quality that I did not previously identify and were not part of the sample data effecting like 30 of 13K total observations,in any event got to run finally TY Commented Jan 17, 2020 at 18:07

3 Answers 3

1

One option is to split the 'PublicFilings' at the , into 'long' format, then create two columns with separate, and reshape into 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
df1 %>%
     separate_rows(PublicFilings, sep = ",\\s+") %>%
     separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
     mutate(key = na_if(key, "No Data"),
           value = as.integer(value %in%  c("Yes", "1"))) %>%
     pivot_wider(names_from = key, values_from = value) %>%
     select(-`NA`)
#    Id Bankruptcies Judgments Liens Suits
#1  1            0         0     0     0
#2  2            1         0     0     0
#3  3           NA        NA    NA    NA

data

df1 <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0", 
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))
Sign up to request clarification or add additional context in comments.

9 Comments

hole new can of worms as pivot_wider function in a an updated version of R and I am unable to update my version, went down the road of trying to install install.packages("devtools") devtools::install_github("hadley/tidyverse") based on different post now not recognizing dpylr package
OK just got tidyr and dplyr reinstalled and working, receiving this message akrun Error in spread(key, value) : object 'key' not found
tried a few different adjustments to spread substitution without any luck
@Ricky Can I know the R version you have. I am using R 3.6.2
3.4.3 version control at a corporate level
|
0

Here is a base R solution, where strsplit() + gsub() are used

u <- strsplit(df$PublicFilings,split = ", ")
u[[which(lengths(u)==1)]] <- rep(NA, max(lengths(u)))
M <- do.call(rbind, Map(function(x) gsub(".*:\\s","",x), u))
M[M=="No"] <- 0
M[M=="Yes"] <- 1
dfout <- cbind(df[1],setNames(data.frame(M),gsub(":.*","",u[[which.max(lengths(u))]])))

such that

> dfout
  Id Bankruptcies Judgments Liens Suits
1  1            0         0     0     0
2  2            1         0     0     0
3  3         <NA>      <NA>  <NA>  <NA>

DATA

df <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0", 
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))

1 Comment

this solution is returning no data in the 4 new data variables vs NA when applied to the broader data set
0

These answers were helpful in working out my problem- I'm saving my slightly different version which essentially does the same thing, captured into a function which could be polished further.

├ Function to extract ----

extract_keyval <- function(string, sep=","){
  string = gsub("No Data", "", string, ignore.case = T) # handle missing data
  key_value =  unlist(strsplit(string, split = sep)) # vector of key-value pairs  
  vars = gsub("(.*):\\s+.*", "\\1", key_value)    # extract variable name
  values = gsub(".*:\\s+(.*)", "\\1", key_value) # extract values
  values = gsub("yes", "1", values, ignore.case = T)
  values = gsub("no", "0", values, ignore.case = T)
  names(values) = vars # attach names to value
  return(values) # named vector 
  }

I typically use the purrr package to do the transformation back to a dataframe:

├ Tidyverse option ----

df1 %>% mutate(map_df(PublicFilings, ~extract_keyval(.x)))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.