Creating bytesIO object

Question

I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I've been advised to look into the io.bytesIO subclass at https://docs.python.org/2/library/io.html#buffered-streams.

However I'm not sure how to pass the PDF body to the bytesIO class and then pass the virtual PDF slate to get the text. So far I have:

class Ove_Spider(BaseSpider):

    name = "ove"


    allowed_domains = ['myurl.com']
    start_urls = ['myurl/hgh/']


    def parse(self, response):
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):

      in_memory_pdf = BytesIO()
      in_memory_pdf.read(response.body) # Trying to read in PDF which is in response body

I'm getting:

in_memory_pdf.read(response.body)
TypeError: integer argument expected, got 'str'

How can I get this working?

Jean-François Fabre · Accepted Answer · 2016-09-30 19:55:25Z

14

When you do in_memory_pdf.read(response.body) you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.

In python 2, just initialize BytesIO as:

 in_memory_pdf = BytesIO(response.body)

In Python 3, you cannot use BytesIO with a string because it expects bytes. The error message shows that response.body is of type str: we have to encode it.

 in_memory_pdf = BytesIO(bytes(response.body,'ascii'))

But as a pdf can be binary data, I suppose that response.body would be bytes, not str. In that case, the simple in_memory_pdf = BytesIO(response.body) works.

edited Sep 30, 2016 at 19:55

answered Sep 30, 2016 at 19:48

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Creating bytesIO object

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related