6

I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams.

However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:

class Ove_Spider(BaseSpider):

    name = "ove"


    allowed_domains = ['myurl.com']
    start_urls = ['myurl/hgh/']


    def parse(self, response):
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):

      in_memory_pdf = BytesIO()
      in_memory_pdf.read(response.body) # Trying to read in PDF which is in response body

I'm getting:

in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'

How can I get this working?

1 Answer 1

14

When you do in_memory_pdf.read(response.body) you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.

In python 2, just initialize BytesIO as:

 in_memory_pdf = BytesIO(response.body)

In Python 3, you cannot use BytesIO with a string because it expects bytes. The error message shows that response.body is of type str: we have to encode it.

 in_memory_pdf = BytesIO(bytes(response.body,'ascii'))

But as a pdf can be binary data, I suppose that response.body would be bytes, not str. In that case, the simple in_memory_pdf = BytesIO(response.body) works.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.