4

I have this sample R scraper script (I can't use actual website):

#!/usr/bin/Rscript

library(RCurl)
library(httr)
library(rvest)
library(lubridate)
library(stringi)

new_files <- Map(function(ln, y, bn) {

  fun1 <- html_session(URLencode(
    paste0("https://example.com", ln)),
    config(ssl_verifypeer = FALSE))

  if(y == Sys.Date()) {writeBin(fun1$response$content, bn)}
    else ("He's dead, Jim")

  return(fun1$response$content)

}, links, dates, names)

I'm running this script in a docker container, through Apache NiFi (the ExecuteProcessor processor). But when I set it to run, I keep getting this error:

Process execution failed due to java.io.IOException: Stream closed: java.io.IOException: Stream closed
     java.io.IOException: Stream closed 
  at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
  at java.io.FilterInputStream.read(FilterInputStream.java:107)
  at org.apache.nifi.processors.standard.ExecuteProcess$4.call(ExecuteProcess.java:367)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

I was reading this answer on closing streams before they should be closed. I have no idea why this would be throwing a closed exception error, when the script works fine on my local computer / in RStudio.

It messes up as soon as it's executed in a docker container. Something to do with my if/else statement within the Map function? I have no clue - or it has something to do with loading the lubridate package.

9
  • Does the R script executes as expected when invoked with the same arguments that make Java call fail? Can you post the code that actually kicks off the script from Java? Commented Feb 24, 2019 at 19:56
  • What is the full docker command you execute? Are you sure it is not going into background? Commented Feb 25, 2019 at 8:40
  • @DavidSoroko I added the shebang that kicks off the script, if that's what you mean. If not, please correct me Commented Feb 25, 2019 at 15:06
  • @ChristophBauer when I run docker, I a regular build and then this: docker run -p 8080:8080 -d nifi-container-name Commented Feb 25, 2019 at 15:08
  • 2
    OK. You will have to narrow down the problem. The script runs in RStudio? Can you run the script from the commandline? Can you run the docker container successfully from the commandline? What are you're properties in the nifi processor. If you operate in this order you should be able to narrow down the problem. Commented Feb 26, 2019 at 8:54

1 Answer 1

1

As several people already mentioned, you are trying to do something complex that would need troubleshooting in multiple areas. I will share some steps to approach this, but please consider the following:

You are using quite a complex solution for what might be a simple problem. Can you think about your problem in one of these ways "I want to scrape a website" or "I want to run a script"?

In that case there is good news, NiFi can easily work with scripts using the ExecuteScript processor, it currently supports these languages:

  • Clojure
  • ECMAScript
  • Groovy
  • lua
  • python
  • ruby

Based on my personal preference I would choose python, you will easily find lots of examples on how to scrape websites.


In case the above is not sufficient, please check the following steps:

  1. Does your script work? (Seems like you already checked this)
  2. Are you able to run a trivial R script from NiFi? (e.g. something that does 1+1 without needed libraries)
  3. Are you able to run any R script from your docker container without NiFi?
  4. Are you able to run this specific R script from your docker container without NiFi?
  5. Are you able to do anything at all with the ExecuteProcessor? For example a simple ls
  6. Are you able to do anything at all with the ExecuteProcessor in that docker container? For example a simple ls

It would be a bit too much to dive into all possibilities, but do check and hopefully the answer is clear, or at least the troubleshooting can be more focussed.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.