Create a partial pdf from bytes in python

Question

I have a pdf file somewhere. This pdf is being send to the destination in equal amount of bytes (apart from the last chunk).

Let's say this pdf file is being read in like this in python:

with open(filename, 'rb') as file:
        chunk = file.read(3000)
        while chunk:
            #the sending method here
            await asyncio.sleep(0.5)
            chunk = file.read(3000)

the question is: Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?

I tried it with pypdfium2 / PyPDF2, but they throw errors until the whole PDF file is arrived:

full_pdf = b''
    def process(self, message):
        self.full_pdf += message
        partial = io.BytesIO(self.full_pdf)
        try:
            pdf=pypdfium2.PdfDocument(partial)
            print(len(pdf))
        except Exception as e:
            print("error", e)

basically I'd like to get the pages of the document, even if it's not the whole document currently.

"Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?" I doubt it. The PDF format is a) not designed for streaming; b) not designed for extracting useful information; c) hideously complex. — Karl Knechtel
– Karl Knechtel, Commented Feb 10, 2023 at 12:48
@KarlKnechtel, I understand that. What I'm thinking of is along the ways of: - Manually discover where each page starts and each page ends. - Creating a new pdf with the content of the part being here (like, manually closing it for the time being) — Patrick Visi
– Patrick Visi, Commented Feb 10, 2023 at 12:53
I don't think there is necessarily anything in the file contents that indicates a page start/end, though I could be wrong. If you can cut the incoming stream at a page boundary that would probably be the basis for a solution though, yes. — Karl Knechtel
– Karl Knechtel, Commented Feb 10, 2023 at 12:54
Incremental loading is possible for linearized PDFs ("fast web view"). pypdfium2 has no helpers for that yet, but you can theoretically do it with the raw API (nsp pypdfium2.raw). Take a look at pdfium/public/fpdf_dataavail.h if you're interested. Making use of these APIs will require some deeper knowledge of c/python/ctypes, though. — mara004
– mara004, Commented May 7, 2023 at 12:56
However, pypdfium2 (and others like pikepdf) can already work with buffer input, but then you don't get download hints. If you want to go that route, note that you must continuously download in background (not just on seeks) and cache downloaded data to achieve reasonable performance. — mara004
– mara004, Commented May 7, 2023 at 13:10

Martin Thoma · Accepted Answer · 2023-05-07 15:20:53Z

2

It's not possible to stream PDF and do anything useful with it before the whole file is present.

According to the PDF 1.7 standard, the structure is:

A one-line header identifying the version of the PDF specification to which the file conforms
A body containing the objects that make up the document contained in the file
A cross-reference table containing information about the indirect objects in the file
A trailer giving the location of the cross-reference table and of certain special objects within the body of the file

The problem is that the x-ref table / trailer is at the end.

PDF Linearization: "fast web view"

The above part is true for arbitrary PDFs. However, it's possible to create so-called "linearized PDF files" (also called "fast web view"). Those files re-order the internal structure of PDF files to make them streamable.

At the moment, pypdf==3.4.0 does not support PDF linearization.

pikepdf claims to support that:

import pikepdf  # pip install pikepdf

with pikepdf.open("input.pdf") as pdf:
    pdf.save("out.pdf", linearize=True)

If you can generate linearized PDF documents, others might be able to stream it partially.

edited May 7, 2023 at 15:20

answered Feb 10, 2023 at 12:55

Martin Thoma

139k174 gold badges687 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Patrick Visi Over a year ago

As much as I hate this, I do appreciate the answer. Why PDF's are constructed this way, is a mistery.

Martin Thoma Over a year ago

PDF is pretty old. Why do you need this in the first place?

Patrick Visi Over a year ago

Word mail merged documents would create lets say, 20.000 pages. These pages should be sent to another place from an officejs addin. Since officejs word addin does not provide an api to create pdfs page by page, the only solution would have been to send the data chunk by chunk and reconstruct it somewhere else

Martin Thoma Over a year ago

As long as you don't need to do anything with the partial file ... sending alone is no problem. You just need to re-construct it completely before you read it again

Patrick Visi Over a year ago

That's the problem, I have other things to do with the pages have already arrived. That's why I wanted to reconstruct the partial pdf

|

Collectives™ on Stack Overflow

Create a partial pdf from bytes in python

1 Answer 1

PDF Linearization: "fast web view"

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

PDF Linearization: "fast web view"

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related