1

I have a really large data set that I need to create new columns for based on text strings from other columns, and put them in a binary format. I have a person ID column and a set of 99 "Diagnosis Codes" that contain a text string that corresponds to a particular health condition.

Sample of Original Data

Person ID Diagnosis Code 1 Diagnosis Code 2 Diagnosis Code 3
10 N18.3 V34.2 E73
11 F35.9 X29 D4.0
12 G27.2 J05.1 J60

I need to get the data into this format....

PersonID N18.3 V34.2 E73 F35.9 G27.2 (plus all other codes)
10 1 1 1 0 0 etc
11 0 0 0 1 0 etc
12 0 0 0 0 1 etc

I have tried transposing, tabluation, so many other different ways and nothing seems to work. I'd appreciate any help!

2 Answers 2

1

A data.table option using melt and dcast.

library(data.table)

dcast(melt(setDT(df),id.vars = 'PersonID'),PersonID ~ value, fun.aggregate = length)

#   PersonID D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
#1:       10    0   1     0     0     0   0     1     1   0
#2:       11    1   0     1     0     0   0     0     0   1
#3:       12    0   0     0     1     1   1     0     0   0
Sign up to request clarification or add additional context in comments.

1 Comment

This works perfectly! Thank you so much, saved me many hours of grief!
1

We could reshape to 'long' format with pivot_longer and then reshape back to 'wide' with pivot_wider while specifying the values_fn as length and values_fill set to 0

library(dplyr)
library(tidyr)
df1 %>%
    pivot_longer(cols = -PersonID) %>%
    select(-name) %>%
    pivot_wider(names_from= value, values_from = value,
       values_fn= length, values_fill = 0)

-output

# A tibble: 3 x 10
  PersonID N18.3 V34.2   E73 F35.9   X29  D4.0 G27.2 J05.1   J60
     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1       10     1     1     1     0     0     0     0     0     0
2       11     0     0     0     1     1     1     0     0     0
3       12     0     0     0     0     0     0     1     1     1

Or using recast from reshape2

library(reshape2)
recast(df1, id.var = 'PersonID', PersonID ~ value, length)
  PersonID D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
1       10    0   1     0     0     0   0     1     1   0
2       11    1   0     1     0     0   0     0     0   1
3       12    0   0     0     1     1   1     0     0   0

Or use base R with table after unlisting the columns other than the first one

table(rep(df1[[1]],  ncol(df1)-1),  unlist(df1[-1]))
      D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
  10    0   1     0     0     0   0     1     1   0
  11    1   0     1     0     0   0     0     0   1
  12    0   0     0     1     1   1     0     0   0

data

df1 <- structure(list(PersonID = 10:12, DiagnosisCode1 = c("N18.3", 
"F35.9", "G27.2"), DiagnosisCode2 = c("V34.2", "X29", "J05.1"
), DiagnosisCode3 = c("E73", "D4.0", "J60")), 
class = "data.frame", row.names = c(NA, 
-3L))

1 Comment

These are all very useful, I especially like the reshape function, I'll be using that very much in the future!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.