R data.table: detect pattern of values within each group

Question

Say I have a data.table like this:

set.seed(10)
data.table(group = rep(c("a","b","c"), each=5), date = rep(1:5,3), value = sample(c(95:105,""),15, replace=TRUE))

Within each group, in the value column, I would like to check (in a simple whay) whether there is a ""(empty character), or a group of empty characters, that is both preceded and followed by a value.

So, this is fine: "", 95,103, etc.... (empty character is first within the group), but the patterns below are examples"missing data" that I would like to detect:

95, "", 103,... (empty character in the middle)

95, "","", 103... (several empty characters in the middle)

95, 103, "" (empty character in the end)

So, in the output below, I would be able to get the row/group A, and if there are many groups, I should get all groups (or rows)

    group date value
 1:     a    1   105
 2:     a    2   103
 3:     a    3   104
 4:     a    4      
 5:     a    5   101
 6:     b    1   102
 7:     b    2   100
 8:     b    3   101
 9:     b    4    97
10:     b    5   102
11:     c    1   104
12:     c    2   101
13:     c    3   104
14:     c    4    96
15:     c    5   102

Edit: What I would need do is to select the rows that have the wrong pattern (so empty string(s) in the middle or in the end), in order to be able to detect whether there are any errors in a large dataset. So in the table in my example, the desired output would be the 4th row as it has a "missing value" (an empty character inbetween values)

     group date value
1:     a    4

(If there were more unwanted rows, of course, I would like to get all of them)

What is your desired output, the rows that meet the criteria or the ones that don't? — Edward
– Edward, Commented Apr 1, 2020 at 3:24
@Edward The ones that do (e.g. I want to check whether there are gaps in a large dataset, and ideally there would be zero) — Djpengo
– Djpengo, Commented Apr 1, 2020 at 7:13
@arg0naut91 4th row? Otherwise just use different seed number... — Djpengo
– Djpengo, Commented Apr 1, 2020 at 7:22

emirhan · Accepted Answer · 2020-04-01 12:55:12Z

1

In case your data.table is not sorted according to 'date' column you can use the following:

DT[order(date), order := c(1:.N) , group]
DT[value == "" & order > 1L]

output:

   group date value order
1:     a    4           4

data is the same as yours:

set.seed(10)
DT <- data.table(group = rep(c("a","b","c"), each=5), date = rep(1:5,3), 
                 value = sample(c(95:105,""),15, replace=TRUE))

answered Apr 1, 2020 at 12:55

emirhan

113 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Djpengo Over a year ago

I don't think this would work in case when within a group there is more than one row that have an empty string in the beginning (e.g. 1st, 2nd, and 3rd have an empty string)?

emirhan Over a year ago

It will work. You can test with the following example: DT <- data.table(group = rep(c("a","b"), each=3), date = c(1,2,3), value = c("","","","95","","100"))

emirhan Over a year ago

The answer basically returns all the rows that has an an empty string as 'value' except the first date of each group.

chinsoon12 · Accepted Answer · 2020-04-01 09:12:48Z

0

Here is an option:

DT[, rw := rleid(value==""), group]
DT[value=="" & rw>1L]

output:

   group date value rw
1:     a    4        2

data:

library(data.table)
set.seed(10)
DT <- data.table(group = rep(c("a","b","c","d"), each=5), 
    date = rep(1:5,4), value = c(sample(c(95:105,""),15, replace=TRUE), c("",2,3,4,5)))

edited Apr 1, 2020 at 9:12

answered Apr 1, 2020 at 2:45

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

7 Comments

Djpengo Over a year ago

Thanks, but what I would lite do is to select the rows that have the wrong pattern (so empty string/strings in the middle), in order to be able to detect whether there are any errors in a large dataset.

chinsoon12 Over a year ago

can you post your desired output?

Djpengo Over a year ago

In the DT you created, it would be: 1: c 5 (15th row of the DT table, which has an empty string as a last value within the group). In the table in my example, it would be the 4th row as it has an empty character inbetween values

chinsoon12 Over a year ago

post as in type in the desired output in your post as wording is quite vague.

chinsoon12 Over a year ago

have also updated my post. the prev ans was because based on I would be able to get the row/group A, and if there are many groups, I should get all groups (or rows), it seemed like you wanted the whole group

|

Collectives™ on Stack Overflow

R data.table: detect pattern of values within each group

2 Answers 2

3 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related