1

I have a very large data.table, which I want to summarise columns by group, where the column names starts with a certain pattern.

The columns I am interested in always have the same format, namely: f<X>_<Y>, m<X>_<Y>, f<X>, m<X>.

This is the list of all possible column names:

ageColsPossible <- c("m0_9", "m10_19", "m20_29", "m30_39", "m40_49", "m50_59", "m60_69",
                   "f0_9", "f10_19", "f20_29", "f30_39", "f40_49", "f50_59", "f60_69") 

if there is not enough data available, my data.table will only have some of these columns. I would like to get a vector with the column names that are available in the data:

>   names(myData)
 [1] "clientID"             "policyID"             "startYear"            "product"              "NOplans"              "grp"                 
 [7] "policyid"             "personid"             "age"                  "gender"               "dependant"            "location"            
[13] "region"               "exposure"             "startMonth"           "cover_effective_date" "endexposuredate"      "fromdate"            
[19] "enddate"              "planHistSufficiency"  "productRank"          "claim10month"         "claim11month"         "claim12month"        
[25] "claim9month"          "NA20_29"              "NA30_39"              "NA40_49"              "NA50_59"              "f0_9"                
[31] "f10_19"               "f20_29"               "f30_39"               "f40_49"               "f50_59"               "f60_69"              
[37] "m0_9"                 "m10_19"               "m20_29"               "m30_39"               "m40_49"               "m50_59"              
[43] "m60_69"               "u0_9"                 "u10_19"               "u20_29"               "u30_39"               "u40_49"              
[49] "u50_59"               "u60_69"               "uNA" 

I know of regrex and was thinking something along the line: regex = "(m|f)(\\d+)_?(\\d+)?", but i have also seen patern() function somewhere. Unfortunately i can no longer find it.

any ideas?

2
  • .SDcols accapets patterns(), so you can select columns for .SD using a regex. Commented May 27, 2020 at 10:15
  • grep("^[mf]\\d+(?:_\\d+)?$", names(myData), value=TRUE)? Commented May 27, 2020 at 10:15

1 Answer 1

1

something like this will most likely do the trick.. assuming you only need one summary-function? (median() in this example)...

DT[, lapply( .SD, median), by=.(group), .SDcols = patterns( "^[mf]\\d+" ) ]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.