3

Seems like an easy one, but... well...

Given a named vector of regular expressions and a data table as follows:

library(data.table)
regexes <- c(a="^A$") 
dt <- fread("
a,A,1
a,B,1
b,A,1
")

The input data table is

dt
#    V1 V2 V3
# 1:  a  A  1
# 2:  a  B  1
# 3:  b  A  1

My goal for the 1st element in regexes would be:

If V1=="a" set V3:=2. EXCEPT when V2 matches the corresponding regular expression ^A$, then V3:=3.

(a is names(regexes)[1], ^A$ is regexes[1], 2 and 3 are just for demo purpose. I also got more names and regular expressions to loop over, and the data set is about 300.000 rows.)

So the expected output is

#    V1 V2 V3
# 1:  a  A  3 (*)
# 2:  a  B  2 (**)
# 3:  b  A  1

(*) 3 because V1 is a and V2 (A) matches the regex,
(**) 2 because V1 is a and V2 (B) does not match ^A$.

I tried to loop through the regexes and pipe the subsetting through like this:

for (x in seq(regexes)) 
  dt[V1==names(regexes)[x], V3:=2][grepl(regexes[x], V2), V3:=3]

However...

dt
#    V1 V2 V3
# 1:  a  A  3 
# 2:  a  B  2
# 3:  b  A  3 <- wrong, should remain 2

... it does not work as expected, grepl uses the complete V2column, not just the V1=="a" subset. I also tried some other things, which worked, but took too long (i.e. not the way to use data.table).

Question: What would be the best data table way to go here? I'm using packageVersion("data.table") ‘1.9.7’.


Note that I could go the data frame route e.g. like this

df <- as.data.frame(dt)
for (x in seq(regexes)) {
  idx <- df$V1==names(regexes)[x]
  df$V3[idx] <- 2
  df$V3[idx][grepl(regexes[x], df$V2[idx])] <- 3 # or ifelse()
}  

But - of course - I would not want to convert the data.table to a data.frame and then back to a data.table if possible.

Thanks in advance!

5
  • dt[condition, blah := boo] returns the modified full dt (smth you can check by adding extra empty [] at the end) - thus the result you get. Add the extra condition to the 2nd set of []. Commented Jun 28, 2016 at 19:55
  • I tried that, @eddi, however when I use V1==names(regexes)[x] & grepl(regexes[x], V2) as a condition (i.e. match each regex over all V2s), it takes ~35 seconds, whereas the data frame example takes ~3. So I thought that is clearly not the way to go here. Commented Jun 28, 2016 at 20:02
  • Something seems fishy about that. Can you add a larger example where the slowdown can be seen? Commented Jun 28, 2016 at 20:15
  • To me it sounds reasonable. Although data.table might automatically build an index for V1, grepl still runs several times over 300.000 rows, not over a few subsets of maybe 200 to 2000 rows, like in the data frame example. Still think a bigger example would be beneficial here? In that case I could try and build an artificial one. Commented Jun 28, 2016 at 20:29
  • ok, gotcha, you're right, no need for an example Commented Jun 28, 2016 at 20:30

1 Answer 1

3

... it does not work as expected, grepl uses the complete V2 column, not just the V1=="a" subset.

I would use stringi, which allows for easy vectorization of regex tests:

library(stringi)
dt[V1 %in% names(regexes), 
  V3 := V3 + 1L + stri_detect(V2, regex = regexes[V1])
]

   V1 V2 V3
1:  a  A  3
2:  a  B  2
3:  b  A  1

The stri_detect family of functions are like grepl from base.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks Frank, didn't know about stri_detect, yet. However, names(regexes) and regexes are tied together in my real data, and regexes is in fact not numeric (I feared the example could be misleading, but.. well..). I still think the best way to conquer it would be to (1.) subset each names(regex), (2.) set a default value and then (3.) set a new value for the matches. Let me know if I should work out a better example...
@lukeA I don't follow. This way does not assume that regexes are numeric (whatever that might mean)... If there's something to do with the size of your example that makes this way slow, though, yeah, a better example (as a function of n) would be good.
I mean V3, the target column, is not numeric. I thought I (1) set it to a default value for the subsets (= names(regexes)), and then (2) to a flag value for the regex matches in that subset. Key here is, that each subset has its own regex. So you can't vectorize over the regexes.
@lukeA You can totally vectorize over regexes. If your regexes object contained b in addition to a, stringi would work just fine, and much more efficiently than if grouping. If you need string values for V3 ("3", "2", "1") instead of numeric, that's a simple change. You can change your fread call to make your example have V3 string and I'll update the answer to show what I mean.
Thanks for your patience, and I think got your point, however, to my knowledge vectorization would only work if all regexes would apply to all subsets. In my case, each value of the regexes vector should only be checked within the subset given by its name. You see what I mean?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.