1

I have a pdf file somewhere. This pdf is being send to the destination in equal amount of bytes (apart from the last chunk).

Let's say this pdf file is being read in like this in python:

with open(filename, 'rb') as file:
        chunk = file.read(3000)
        while chunk:
            #the sending method here
            await asyncio.sleep(0.5)
            chunk = file.read(3000)

the question is: Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?

I tried it with pypdfium2 / PyPDF2, but they throw errors until the whole PDF file is arrived:

full_pdf = b''
    def process(self, message):
        self.full_pdf += message
        partial = io.BytesIO(self.full_pdf)
        try:
            pdf=pypdfium2.PdfDocument(partial)
            print(len(pdf))
        except Exception as e:
            print("error", e)

basically I'd like to get the pages of the document, even if it's not the whole document currently.

5
  • 1
    "Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?" I doubt it. The PDF format is a) not designed for streaming; b) not designed for extracting useful information; c) hideously complex. Commented Feb 10, 2023 at 12:48
  • @KarlKnechtel, I understand that. What I'm thinking of is along the ways of: - Manually discover where each page starts and each page ends. - Creating a new pdf with the content of the part being here (like, manually closing it for the time being) Commented Feb 10, 2023 at 12:53
  • I don't think there is necessarily anything in the file contents that indicates a page start/end, though I could be wrong. If you can cut the incoming stream at a page boundary that would probably be the basis for a solution though, yes. Commented Feb 10, 2023 at 12:54
  • Incremental loading is possible for linearized PDFs ("fast web view"). pypdfium2 has no helpers for that yet, but you can theoretically do it with the raw API (nsp pypdfium2.raw). Take a look at pdfium/public/fpdf_dataavail.h if you're interested. Making use of these APIs will require some deeper knowledge of c/python/ctypes, though. Commented May 7, 2023 at 12:56
  • However, pypdfium2 (and others like pikepdf) can already work with buffer input, but then you don't get download hints. If you want to go that route, note that you must continuously download in background (not just on seeks) and cache downloaded data to achieve reasonable performance. Commented May 7, 2023 at 13:10

1 Answer 1

2

It's not possible to stream PDF and do anything useful with it before the whole file is present.

According to the PDF 1.7 standard, the structure is:

  1. A one-line header identifying the version of the PDF specification to which the file conforms
  2. A body containing the objects that make up the document contained in the file
  3. A cross-reference table containing information about the indirect objects in the file
  4. A trailer giving the location of the cross-reference table and of certain special objects within the body of the file

The problem is that the x-ref table / trailer is at the end.

PDF Linearization: "fast web view"

The above part is true for arbitrary PDFs. However, it's possible to create so-called "linearized PDF files" (also called "fast web view"). Those files re-order the internal structure of PDF files to make them streamable.

At the moment, pypdf==3.4.0 does not support PDF linearization.

pikepdf claims to support that:

import pikepdf  # pip install pikepdf

with pikepdf.open("input.pdf") as pdf:
    pdf.save("out.pdf", linearize=True)

If you can generate linearized PDF documents, others might be able to stream it partially.

Sign up to request clarification or add additional context in comments.

8 Comments

As much as I hate this, I do appreciate the answer. Why PDF's are constructed this way, is a mistery.
PDF is pretty old. Why do you need this in the first place?
Word mail merged documents would create lets say, 20.000 pages. These pages should be sent to another place from an officejs addin. Since officejs word addin does not provide an api to create pdfs page by page, the only solution would have been to send the data chunk by chunk and reconstruct it somewhere else
As long as you don't need to do anything with the partial file ... sending alone is no problem. You just need to re-construct it completely before you read it again
That's the problem, I have other things to do with the pages have already arrived. That's why I wanted to reconstruct the partial pdf
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.