Saw an answer in Python similar for this question but not R so for the sake of being redundant since python is not in my wheel house. Data variable "PublicFilings" contains multiple values that I would like to split into 4 new variables. There are three basic outputs listed below but there will be different combinations of counts for Judgments, Liens and Suits, needless to say bankruptcy is yes no but I'd like that binary. Any thoughts on a simply approach for a data frame? Id can be used as a primary key the combination of No data as an initial output, inability to use comma to separate and desire to convert yes no to binary is throwing me off.
Existing Data
Id PublicFilings
1 Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0
2 Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0
3 No Data
No Data implies there was no match to an entity and no Public Filing data was returned
Converted Data
Id Bankruptcies Judgments Liens Suits
1 0 0 0 0
2 1 0 0 0
3 Null Null Null Null
df1 <-
structure(list(TranId = 1:3,
Name = c("ACME Five,","ACME","WALMART"),
Check = c("1234","1234","1235"),
Entity = c("55555","55551","55556"),
Match =c("0","0","0"),
Score = c("50","60","NA"),
Date = c("2019-01-01", "2019-01-02","2019-01-02"),
PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 10, Suits: 0",
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0",
"No Data"),
Controls =c("2015","2015","1998"),
NumEmpoyees = c("5","8","6"),
LOB = c("Retail, Food","Retail, Food","Retail, All"),
PayScore = c("40","42","NA"),
Primary = c("CEO","CEO","CFO"),
STARTYear = c("1982","1982","1965"),
SpecEvent = c("0","0","0"),
Filings =c("0","0","1"),
PayExp =c("","","1"
)), class = "data.frame", row.names = c(NA, -3L))
View(df1)
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(PublicFilings, sep = ",\\s+") %>%
separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
mutate(key = na_if(key, "No Data"),
value = as.integer(value %in% c("Yes", "1"))) %>%
pivot_wider(names_from = key, values_from = value) %>%
select(-`NA`)
View(df1)
# A tibble: 3 x 20
TranId Name Check Entity Match Score Date Controls NumEmpoyees LOB PayScore Primary STARTYear SpecEvent Filings
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 ACME~ 1234 55555 0 50 2019~ 2015 5 Reta~ 40 CEO 1982 0 0
2 2 ACME 1234 55551 0 60 2019~ 2015 8 Reta~ 42 CEO 1982 0 0
3 3 WALM~ 1235 55556 0 NA 2019~ 1998 6 Reta~ NA CFO 1965 0 1
# ... with 5 more variables: PayExp <chr>, Bankruptcies <int>, Judgments <int>, Liens <int>, Suits <int>
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [9].
No Datadf1 %>% mutate(PublicFilings = replace(PublicFilings, PublicFilings == 'No Data', 'NoData: No Data')) %>% separate_rows(PublicFilings, sep = ",\\s+") %>% separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>% mutate(key = na_if(key, "No Data"), value = as.integer(value %in% c("Yes", "1"))) %>% pivot_wider(names_from = key, values_from = value)or make some adjustments in the regex pattern