4

Background

I have some financial data (1.5 years SP500 stocks) that I have manipulated into a wide format using the data.table package. After following the whole data.table course on Datacamp, I'm starting to get a hang of the basics, but after searching for hours I'm at a loss on how to do this.

The Problem

The data contains columns with financial data for each stock. I need to delete columns that contain two consecutive NAs.

My guess is I have to use rle(), lapply(), to find consecutive values and DT[,x:= NULL]) to delete the columns.

I read that rle() doesn't work on NAs, so I changed them to Inf instead. I just don't know how to combine the functions so that I can efficiently remove a few columns among the 460 that I have.

An answer using data.table would be great, but anything that works well is very much appreciated.

Alternatively I would love to know how to remove columns containing at least 1 NA

Example data

> test[1:5,1:5,with=FALSE]
         date     10104     10107     10138     10145
1: 2012-07-02  0.003199       Inf  0.001112 -0.012178
2: 2012-07-03  0.005873  0.006545  0.001428       Inf
3: 2012-07-05       Inf -0.001951 -0.011090       Inf
4: 2012-07-06       Inf -0.016775 -0.009612       Inf
5: 2012-07-09 -0.002742 -0.006129 -0.001294  0.005830
> dim(test)
[1] 377 461

Desired outcome

         date     10107     10138
1: 2012-07-02       Inf  0.001112
2: 2012-07-03  0.006545  0.001428
3: 2012-07-05 -0.001951 -0.011090
4: 2012-07-06 -0.016775 -0.009612
5: 2012-07-09 -0.006129 -0.001294

PS. This is my first question, I have tried to adhere to the rules, if I need to change anything please let me know.

4 Answers 4

4

Here's an rle version:

dt[, sapply(dt, function(x)
       setDT(rle(is.na(x)))[, sum(lengths > 1 & values) == 0]), with = F]

Or replace the is.na with is.infinite if you like.

Sign up to request clarification or add additional context in comments.

Comments

2

To detect and delete columns containing atleast one NA, you can try the following

data = data.frame(A=c(1,2,3,4,5), B=c(2,3,4,NA,6), C=c(3,4,5,6,7), D=c(4,5,NA,NA,8))

colsToDelete = lapply(data, FUN = function(x){ sum(is.na(x)) >= 1 })

data.formatted = data[,c(!unlist(colsToDelete))]

10 Comments

If I understand correctly, multiple NA entries are acceptable, however, consecutive NA entries are not.
I was trying to answer the alternate part of the question "Alternatively I would love to know how to remove columns containing at least 1 NA"
You do realize this is a data.table question, right? I can't find any data.table syntax around here...
@DavidArenburg, @blakeoft, I'm new to R (writing my thesis) and I did learn data.table to reshape the data. I'm happy with the base R solution because I am pressed for time, although I would love to know if there is a data.table solution. I'll edit the question accordingly.
@SKG thank you for clearing that out for me. Very new to data.table.
|
1

Obviously the issue is finding consecutive missing. First, create a matrix TRUE/FALSE based on missing NA. Use that matrix to compare each row to next. Keep columns in original matrix where colSums == 0

Try this:

Missing.Mat <- apply(test, 2, is.na)
Consecutive.Mat <- Missing.Mat[-nrow(Missing.Mat),] * Missing.Mat[-1,]
Keep.Cols <- colSums(Consecutive.Mat) == 0

test[,Keep.Cols]

9 Comments

to delete columns with at least one NA use Keep.Cols <- colSums(Missing.Ma)==0 instead.
Thanks Aaron, this is great. Because I use the data.table package, the last line has to be test[,Keep.Cols, with=FALSE] I was totally missing the point by focussing on only the data.table. Do I understand it correctly that you make Consecutive.Mat by multiplying offset versions of Missing.Mat, where two consecutive would "overlap"?
that's right. It does this solution still work for you?
I think a bit more data.tableish version would be something like setDT(df1)[, names(Filter(isTRUE, lapply(.SD, function(x) sum(x[-1L] == x[-.N]) == 0)))] ; df1[, names(indx), with = FALSE]
@StevenBeaupré it's just a slight modification of this answer as I simply can't trace any data.table syntax around here.
|
0

This is what I came up with. It calls rle on a vector y that is 1:length(column) unless a corresponding element of the column is Inf, in which case the corresponding value in y is zero. Then it checks if any of the runs are greater than 1.

keep <- c(date = T, apply(dat[, -1], 2,
              function(x) {
                y <- 1:length(x)
                y[!is.finite(x)] <- 0
                return(!any(rle(y)$lengths > 1))
              }))

dat2 <- dat[, keep]
dat2
#         date    X10107    X10138
# 1 2012-07-02       Inf  0.001112
# 2 2012-07-03  0.006545  0.001428
# 3 2012-07-05 -0.001951 -0.011090
# 4 2012-07-06 -0.016775 -0.009612
# 5 2012-07-09 -0.006129 -0.001294

Note that the column names are prepended with an "X" by read.table.

Now, the dput of the data:

dat <- structure(list(date = c("2012-07-02", "2012-07-03", "2012-07-05", 
"2012-07-06", "2012-07-09"), X10104 = c(0.003199, 0.005873, Inf, 
Inf, -0.002742), X10107 = c(Inf, 0.006545, -0.001951, -0.016775, 
-0.006129), X10138 = c(0.001112, 0.001428, -0.01109, -0.009612, 
-0.001294), X10145 = c(-0.012178, Inf, Inf, Inf, 0.00583)), .Names = c("date", 
"X10104", "X10107", "X10138", "X10145"), class = "data.frame", row.names = c(NA, 
-5L))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.