Seems like an easy one, but... well...
Given a named vector of regular expressions and a data table as follows:
library(data.table)
regexes <- c(a="^A$")
dt <- fread("
a,A,1
a,B,1
b,A,1
")
The input data table is
dt
# V1 V2 V3
# 1: a A 1
# 2: a B 1
# 3: b A 1
My goal for the 1st element in regexes would be:
If V1=="a" set V3:=2. EXCEPT when V2 matches the corresponding regular expression ^A$, then V3:=3.
(a is names(regexes)[1], ^A$ is regexes[1], 2 and 3 are just for demo purpose. I also got more names and regular expressions to loop over, and the data set is about 300.000 rows.)
So the expected output is
# V1 V2 V3
# 1: a A 3 (*)
# 2: a B 2 (**)
# 3: b A 1
(*) 3 because V1 is a and V2 (A) matches the regex,
(**) 2 because V1 is a and V2 (B) does not match ^A$.
I tried to loop through the regexes and pipe the subsetting through like this:
for (x in seq(regexes))
dt[V1==names(regexes)[x], V3:=2][grepl(regexes[x], V2), V3:=3]
However...
dt
# V1 V2 V3
# 1: a A 3
# 2: a B 2
# 3: b A 3 <- wrong, should remain 2
... it does not work as expected, grepl uses the complete V2column, not just the V1=="a" subset. I also tried some other things, which worked, but took too long (i.e. not the way to use data.table).
Question: What would be the best data table way to go here? I'm using packageVersion("data.table") ‘1.9.7’.
Note that I could go the data frame route e.g. like this
df <- as.data.frame(dt)
for (x in seq(regexes)) {
idx <- df$V1==names(regexes)[x]
df$V3[idx] <- 2
df$V3[idx][grepl(regexes[x], df$V2[idx])] <- 3 # or ifelse()
}
But - of course - I would not want to convert the data.table to a data.frame and then back to a data.table if possible.
Thanks in advance!
dt[condition, blah := boo]returns the modified fulldt(smth you can check by adding extra empty[]at the end) - thus the result you get. Add the extra condition to the 2nd set of[].V1==names(regexes)[x] & grepl(regexes[x], V2)as a condition (i.e. match each regex over allV2s), it takes ~35 seconds, whereas the data frame example takes ~3. So I thought that is clearly not the way to go here.V1,greplstill runs several times over 300.000 rows, not over a few subsets of maybe 200 to 2000 rows, like in the data frame example. Still think a bigger example would be beneficial here? In that case I could try and build an artificial one.