r data.table: Subsetting and assignment by reference in a for loop

Question

Seems like an easy one, but... well...

Given a named vector of regular expressions and a data table as follows:

library(data.table)
regexes <- c(a="^A$") 
dt <- fread("
a,A,1
a,B,1
b,A,1
")

The input data table is

dt
#    V1 V2 V3
# 1:  a  A  1
# 2:  a  B  1
# 3:  b  A  1

My goal for the 1st element in regexes would be:

If V1=="a" set V3:=2. EXCEPT when V2 matches the corresponding regular expression ^A$, then V3:=3.

(a is names(regexes)[1], ^A$ is regexes[1], 2 and 3 are just for demo purpose. I also got more names and regular expressions to loop over, and the data set is about 300.000 rows.)

So the expected output is

#    V1 V2 V3
# 1:  a  A  3 (*)
# 2:  a  B  2 (**)
# 3:  b  A  1

(*) 3 because V1 is a and V2 (A) matches the regex,
(**) 2 because V1 is a and V2 (B) does not match ^A$.

I tried to loop through the regexes and pipe the subsetting through like this:

for (x in seq(regexes)) 
  dt[V1==names(regexes)[x], V3:=2][grepl(regexes[x], V2), V3:=3]

However...

dt
#    V1 V2 V3
# 1:  a  A  3 
# 2:  a  B  2
# 3:  b  A  3 <- wrong, should remain 2

... it does not work as expected, grepl uses the complete V2column, not just the V1=="a" subset. I also tried some other things, which worked, but took too long (i.e. not the way to use data.table).

Question: What would be the best data table way to go here? I'm using packageVersion("data.table") ‘1.9.7’.

Note that I could go the data frame route e.g. like this

df <- as.data.frame(dt)
for (x in seq(regexes)) {
  idx <- df$V1==names(regexes)[x]
  df$V3[idx] <- 2
  df$V3[idx][grepl(regexes[x], df$V2[idx])] <- 3 # or ifelse()
}

But - of course - I would not want to convert the data.table to a data.frame and then back to a data.table if possible.

Thanks in advance!

dt[condition, blah := boo] returns the modified full dt (smth you can check by adding extra empty [] at the end) - thus the result you get. Add the extra condition to the 2nd set of []. — eddi
– eddi, Commented Jun 28, 2016 at 19:55
I tried that, @eddi, however when I use V1==names(regexes)[x] & grepl(regexes[x], V2) as a condition (i.e. match each regex over all V2s), it takes ~35 seconds, whereas the data frame example takes ~3. So I thought that is clearly not the way to go here. — lukeA
– lukeA, Commented Jun 28, 2016 at 20:02
Something seems fishy about that. Can you add a larger example where the slowdown can be seen? — eddi
– eddi, Commented Jun 28, 2016 at 20:15
To me it sounds reasonable. Although data.table might automatically build an index for V1, grepl still runs several times over 300.000 rows, not over a few subsets of maybe 200 to 2000 rows, like in the data frame example. Still think a bigger example would be beneficial here? In that case I could try and build an artificial one. — lukeA
– lukeA, Commented Jun 28, 2016 at 20:29

Frank · Accepted Answer · 2016-06-28 19:46:58Z

3

... it does not work as expected, grepl uses the complete V2 column, not just the V1=="a" subset.

I would use stringi, which allows for easy vectorization of regex tests:

library(stringi)
dt[V1 %in% names(regexes), 
  V3 := V3 + 1L + stri_detect(V2, regex = regexes[V1])
]

   V1 V2 V3
1:  a  A  3
2:  a  B  2
3:  b  A  1

The stri_detect family of functions are like grepl from base.

answered Jun 28, 2016 at 19:46

Frank

66.9k8 gold badges104 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

lukeA Over a year ago

Thanks Frank, didn't know about stri_detect, yet. However, names(regexes) and regexes are tied together in my real data, and regexes is in fact not numeric (I feared the example could be misleading, but.. well..). I still think the best way to conquer it would be to (1.) subset each names(regex), (2.) set a default value and then (3.) set a new value for the matches. Let me know if I should work out a better example...

Frank Over a year ago

@lukeA I don't follow. This way does not assume that regexes are numeric (whatever that might mean)... If there's something to do with the size of your example that makes this way slow, though, yeah, a better example (as a function of n) would be good.

lukeA Over a year ago

I mean V3, the target column, is not numeric. I thought I (1) set it to a default value for the subsets (= names(regexes)), and then (2) to a flag value for the regex matches in that subset. Key here is, that each subset has its own regex. So you can't vectorize over the regexes.

Frank Over a year ago

@lukeA You can totally vectorize over regexes. If your regexes object contained b in addition to a, stringi would work just fine, and much more efficiently than if grouping. If you need string values for V3 ("3", "2", "1") instead of numeric, that's a simple change. You can change your fread call to make your example have V3 string and I'll update the answer to show what I mean.

lukeA Over a year ago

Thanks for your patience, and I think got your point, however, to my knowledge vectorization would only work if all regexes would apply to all subsets. In my case, each value of the regexes vector should only be checked within the subset given by its name. You see what I mean?

|

Collectives™ on Stack Overflow

r data.table: Subsetting and assignment by reference in a for loop

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related