Summarize columns where names have a specific pattern in data.table

Question

I have a very large data.table, which I want to summarise columns by group, where the column names starts with a certain pattern.

The columns I am interested in always have the same format, namely: f<X>_<Y>, m<X>_<Y>, f<X>, m<X>.

This is the list of all possible column names:

ageColsPossible <- c("m0_9", "m10_19", "m20_29", "m30_39", "m40_49", "m50_59", "m60_69",
                   "f0_9", "f10_19", "f20_29", "f30_39", "f40_49", "f50_59", "f60_69")

if there is not enough data available, my data.table will only have some of these columns. I would like to get a vector with the column names that are available in the data:

>   names(myData)
 [1] "clientID"             "policyID"             "startYear"            "product"              "NOplans"              "grp"                 
 [7] "policyid"             "personid"             "age"                  "gender"               "dependant"            "location"            
[13] "region"               "exposure"             "startMonth"           "cover_effective_date" "endexposuredate"      "fromdate"            
[19] "enddate"              "planHistSufficiency"  "productRank"          "claim10month"         "claim11month"         "claim12month"        
[25] "claim9month"          "NA20_29"              "NA30_39"              "NA40_49"              "NA50_59"              "f0_9"                
[31] "f10_19"               "f20_29"               "f30_39"               "f40_49"               "f50_59"               "f60_69"              
[37] "m0_9"                 "m10_19"               "m20_29"               "m30_39"               "m40_49"               "m50_59"              
[43] "m60_69"               "u0_9"                 "u10_19"               "u20_29"               "u30_39"               "u40_49"              
[49] "u50_59"               "u60_69"               "uNA"

I know of regrex and was thinking something along the line: regex = "(m|f)(\\d+)_?(\\d+)?", but i have also seen patern() function somewhere. Unfortunately i can no longer find it.

any ideas?

.SDcols accapets patterns(), so you can select columns for .SD using a regex. — Wimpel
– Wimpel, Commented May 27, 2020 at 10:15

Wimpel · Accepted Answer · 2020-05-27 10:22:03Z

1

something like this will most likely do the trick.. assuming you only need one summary-function? (median() in this example)...

DT[, lapply( .SD, median), by=.(group), .SDcols = patterns( "^[mf]\\d+" ) ]

answered May 27, 2020 at 10:22

Wimpel

27.9k1 gold badge25 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Summarize columns where names have a specific pattern in data.table

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related