Python Image extraction sequence from pdf

Question

I was trying to extract images from a pdf using PyMuPDF (fitz). My pdf has multiple images in a single page. I am maintaining a proper sequence number while saving my images. I saw that the images being extracted don't follow a proper sequence. Sometimes it is starting to extract from the bottom, sometimes from the top and so on. Is there a way to modify my code so that the extraction follow a proper sequence? Given below is the code I am using :

import fitz
from PIL import Image
filename = "document.pdf"
doc = fitz.open(filename)

for i in range(len(doc)):
    img_num = 0
    p_no = 1
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha < 4:
            img_num += 1       
            pix.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
        else:
            img_num += 1              
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
            pix1 = None
        pix = None
        p_no += 1

Given below is a sample page of the pdf

Nope. Doesn't work. I get the following error : TypeError: 'NoneType' object is not iterable — Sabster
– Sabster, Commented Dec 2, 2020 at 20:29

Ecko · Accepted Answer · 2020-12-05 13:31:44Z

1

I have the same problem I've used the following code:

import fitz 
import io
from PIL import Image


file = "file_path"
pdf_file = fitz.open(file)


for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found  {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on the given pdf page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        print(img)
        print(image_index)
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

The most probable way is to locate the 'img' var and order them. I'd love to hear any further sggestions or if you found better idea/solution.

answered Dec 5, 2020 at 13:31

Ecko

1159 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anand Over a year ago

I am facing similar issue :( Struggling to match the image with text. While the text can be extracted in order, the images aren't which is a real pain!

Collectives™ on Stack Overflow

Python Image extraction sequence from pdf

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related