1

All of the code included in this question is from the script called "LASSO code (Version for Antony)" in my GitHub Repo for this project. And you can run it on the file folder called "last 40" to verify my claim that it does run on limited sized datasets and if you really feel like going the extra mile, message me here and I'll share a 10k scale file folder full of datasets zipped of via OneDrive or Google Drive (whichever you prefer lad) with ya so you can also verify that the same script doesn't work in file folders of that volume.

This is absolutely going to drive me mad I swear, I have been using the lappy function below without issue for a week now, and starting several hours ago, it is giving me this error:

> datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
Error in checkForRemoteErrors(val) : 
  7 nodes produced errors; first error: could not find function "fread" 

Here is the rest of the script I am working with up until this line (after the lines I used to load all of the libraries I utilize):

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/12th & 13th 10k"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)


# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
  # split apart the numbers, convert them to numeric 
  strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
  # get them in a data frame
  matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)

DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- makeCluster(detectCores() - 2L)
clusterExport(CL, c('paths_list'))
library(data.table)
system.time( datasets <- parLapply(CL, paths_list, fread) )

After looking up the documentation for the 3rd time today, I am thinking of trying:

system.time( datasets <- parLapply(CL, paths_list, fun = fread) )

Will that work??

p.s. Here is all of the libraries I load as the first thing I do:

# load all necessary packages
library(plyr)
library(dplyr)
library(tidyverse)
library(readr)
library(stringi)
library(purrr)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)

Also, I have already tried the following and none worked:

datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
datasets <- parLapply(CL, paths_list, function(i) {fread[i]})
datasets <- parLapply(CL, paths_list, function(i) {fread[[i]]})

datasets <- parLapply(CL, paths_list, \(ds) 
                      {fread(ds)})

system.time( datasets <- lapply(paths_list, fread) )

And when I run that last one, datasets <- lapply(paths_list, fread), I get the same error, this was exactly the original successful version I ran at the beginning of last week and I only chose to use the parallel version because the datasets folder I am importing/loading has 260,000 csv file-formatted datasets in it. So, this means two version which have worked dozens of times already just stopped working suddenly today!

10
  • 1
    What happens if you qualify the function with its package (ie, replace fread() with data.table::fread())? Commented Jan 8, 2023 at 17:45
  • Any chance it's related to this? stackoverflow.com/q/18035711/1082435 Commented Jan 8, 2023 at 17:47
  • 1
    I'm losing track a little of the versions. I misspoke: I think it should be data.table::fread (without the parentheses). Commented Jan 8, 2023 at 18:40
  • 1
    If you make your example reproducible (eg, don't use datasets in a local directory), it will be easier for others to experiment on their own machines. But I know your example is tougher than most for this. Commented Jan 8, 2023 at 18:42
  • 1
    @wibeasley good point in your last comment, I was literally seeing red when typing this question originally, so I forgot to add in a link to my GitHub repository with the code that has a subset with only 40 datasets you can run for yourself to verify that it indeed does run just fine. Commented Jan 8, 2023 at 18:46

1 Answer 1

1

See if this works consistently. It hasn't failed yet on my Windows desktop with 20k files (I copied & pasted your 40 files a bunch). It's run 5 times and I've restarted the R session and RStudio each time.

It's too bad that the problem arises non-deterministically, but that's part of the parallel-computation game. See if this stripped-down example run consistently?

Notice I'm avoiding library() to eliminate naming collisions caused by packages with identically-named functions. Also, I closed the cluster connection at the end.

# Enumerate files
paths_list <- 
  "~/Documents/delete-me/EER-Research-Project-main/20k" |> 
  list.files(full.names = T, recursive = T)

# Establish cluster
CL <- parallel::makeCluster(parallel::detectCores() - 2L)
parallel::clusterExport(CL, c('paths_list'))

# Read files
system.time({
  datasets <- parallel::parLapply(CL, paths_list, data.table::fread)
})

# Stop cluster
parallel::stopCluster(CL)

#>    user  system elapsed 
#>    7.09    1.22  101.93 
Sign up to request clarification or add additional context in comments.

5 Comments

WOW, maybe there is a god and he doesn't hold me being a ginger against me after all! It works again, hallelujah. Do you have a PayPal or a Venmo kind sir? I am broke, but I'd like to send you a few bucks
That's a nice offer --instead take some grad student to lunch after you graduate. I'm glad it works on your Windows machine. I didn't test it on any of mine, and I remember there are backend differences between parallel's implementation on Windows & Linux.
actually, I appear to have spoken too soon although I assure you this is not your fault. It ran on 10, 40, 100, and 1,000 datasets, but not on 10,000.
Bummer. I simplified things in response. See the edited version.
@wibeasely it is running again now, I was actually able to get it to work using your suggestions mainly and a bit of tinkering. The final version now is this CL <- makeCluster(detectCores() - 3L) clusterExport(CL, c('paths_list')) system.time( datasets <- parLapply(cl = CL, X = paths_list, fun = data.table::fread)) But, I am going to add in the parallel:: suggestions as well right now because I have been pushed to the point of paranoia by circumstances!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.