How to delete columns from a data.table based on values in column

Question

Background

I have some financial data (1.5 years SP500 stocks) that I have manipulated into a wide format using the data.table package. After following the whole data.table course on Datacamp, I'm starting to get a hang of the basics, but after searching for hours I'm at a loss on how to do this.

The Problem

The data contains columns with financial data for each stock. I need to delete columns that contain two consecutive NAs.

My guess is I have to use rle(), lapply(), to find consecutive values and DT[,x:= NULL]) to delete the columns.

I read that rle() doesn't work on NAs, so I changed them to Inf instead. I just don't know how to combine the functions so that I can efficiently remove a few columns among the 460 that I have.

An answer using data.table would be great, but anything that works well is very much appreciated.

Alternatively I would love to know how to remove columns containing at least 1 NA

Example data

> test[1:5,1:5,with=FALSE]
         date     10104     10107     10138     10145
1: 2012-07-02  0.003199       Inf  0.001112 -0.012178
2: 2012-07-03  0.005873  0.006545  0.001428       Inf
3: 2012-07-05       Inf -0.001951 -0.011090       Inf
4: 2012-07-06       Inf -0.016775 -0.009612       Inf
5: 2012-07-09 -0.002742 -0.006129 -0.001294  0.005830
> dim(test)
[1] 377 461

Desired outcome

         date     10107     10138
1: 2012-07-02       Inf  0.001112
2: 2012-07-03  0.006545  0.001428
3: 2012-07-05 -0.001951 -0.011090
4: 2012-07-06 -0.016775 -0.009612
5: 2012-07-09 -0.006129 -0.001294

PS. This is my first question, I have tried to adhere to the rules, if I need to change anything please let me know.

eddi · Accepted Answer · 2015-06-15 21:55:22Z

4

Here's an rle version:

dt[, sapply(dt, function(x)
       setDT(rle(is.na(x)))[, sum(lengths > 1 & values) == 0]), with = F]

Or replace the is.na with is.infinite if you like.

answered Jun 15, 2015 at 21:55

eddi

49.5k6 gold badges109 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SKG · Accepted Answer · 2015-06-15 19:58:50Z

2

To detect and delete columns containing atleast one NA, you can try the following

data = data.frame(A=c(1,2,3,4,5), B=c(2,3,4,NA,6), C=c(3,4,5,6,7), D=c(4,5,NA,NA,8))

colsToDelete = lapply(data, FUN = function(x){ sum(is.na(x)) >= 1 })

data.formatted = data[,c(!unlist(colsToDelete))]

answered Jun 15, 2015 at 19:58

SKG

1,4622 gold badges13 silver badges23 bronze badges

10 Comments

blakeoft Over a year ago

If I understand correctly, multiple NA entries are acceptable, however, consecutive NA entries are not.

SKG Over a year ago

I was trying to answer the alternate part of the question "Alternatively I would love to know how to remove columns containing at least 1 NA"

David Arenburg Over a year ago

You do realize this is a data.table question, right? I can't find any data.table syntax around here...

Cristopher van der Kooij Over a year ago

@DavidArenburg, @blakeoft, I'm new to R (writing my thesis) and I did learn data.table to reshape the data. I'm happy with the base R solution because I am pressed for time, although I would love to know if there is a data.table solution. I'll edit the question accordingly.

David Arenburg Over a year ago

@SKG thank you for clearing that out for me. Very new to data.table.

|

Aaron Katch · Accepted Answer · 2015-06-15 19:45:09Z

1

Obviously the issue is finding consecutive missing. First, create a matrix TRUE/FALSE based on missing NA. Use that matrix to compare each row to next. Keep columns in original matrix where colSums == 0

Try this:

Missing.Mat <- apply(test, 2, is.na)
Consecutive.Mat <- Missing.Mat[-nrow(Missing.Mat),] * Missing.Mat[-1,]
Keep.Cols <- colSums(Consecutive.Mat) == 0

test[,Keep.Cols]

answered Jun 15, 2015 at 19:45

Aaron Katch

4513 silver badges13 bronze badges

9 Comments

Aaron Katch Over a year ago

to delete columns with at least one NA use Keep.Cols <- colSums(Missing.Ma)==0 instead.

Cristopher van der Kooij Over a year ago

Thanks Aaron, this is great. Because I use the data.table package, the last line has to be test[,Keep.Cols, with=FALSE] I was totally missing the point by focussing on only the data.table. Do I understand it correctly that you make Consecutive.Mat by multiplying offset versions of Missing.Mat, where two consecutive would "overlap"?

Aaron Katch Over a year ago

that's right. It does this solution still work for you?

David Arenburg Over a year ago

I think a bit more data.tableish version would be something like

setDT(df1)[, names(Filter(isTRUE, lapply(.SD, function(x) sum(x[-1L] == x[-.N]) == 0)))] ; df1[, names(indx), with = FALSE]

David Arenburg Over a year ago

@StevenBeaupré it's just a slight modification of this answer as I simply can't trace any data.table syntax around here.

|

blakeoft · Accepted Answer · 2015-06-15 19:58:08Z

This is what I came up with. It calls rle on a vector y that is 1:length(column) unless a corresponding element of the column is Inf, in which case the corresponding value in y is zero. Then it checks if any of the runs are greater than 1.

keep <- c(date = T, apply(dat[, -1], 2,
              function(x) {
                y <- 1:length(x)
                y[!is.finite(x)] <- 0
                return(!any(rle(y)$lengths > 1))
              }))

dat2 <- dat[, keep]
dat2
#         date    X10107    X10138
# 1 2012-07-02       Inf  0.001112
# 2 2012-07-03  0.006545  0.001428
# 3 2012-07-05 -0.001951 -0.011090
# 4 2012-07-06 -0.016775 -0.009612
# 5 2012-07-09 -0.006129 -0.001294

Note that the column names are prepended with an "X" by read.table.

Now, the dput of the data:

dat <- structure(list(date = c("2012-07-02", "2012-07-03", "2012-07-05", 
"2012-07-06", "2012-07-09"), X10104 = c(0.003199, 0.005873, Inf, 
Inf, -0.002742), X10107 = c(Inf, 0.006545, -0.001951, -0.016775, 
-0.006129), X10138 = c(0.001112, 0.001428, -0.01109, -0.009612, 
-0.001294), X10145 = c(-0.012178, Inf, Inf, Inf, 0.00583)), .Names = c("date", 
"X10104", "X10107", "X10138", "X10145"), class = "data.frame", row.names = c(NA, 
-5L))

Collectives™ on Stack Overflow

How to delete columns from a data.table based on values in column

Background

The Problem

Example data

Desired outcome

4 Answers 4

Comments

10 Comments

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Background

The Problem

Example data

Desired outcome

4 Answers 4

Comments

10 Comments

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related