3

I was trying to extract images from a pdf using PyMuPDF (fitz). My pdf has multiple images in a single page. I am maintaining a proper sequence number while saving my images. I saw that the images being extracted don't follow a proper sequence. Sometimes it is starting to extract from the bottom, sometimes from the top and so on. Is there a way to modify my code so that the extraction follow a proper sequence? Given below is the code I am using :

import fitz
from PIL import Image
filename = "document.pdf"
doc = fitz.open(filename)

for i in range(len(doc)):
    img_num = 0
    p_no = 1
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha < 4:
            img_num += 1       
            pix.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
        else:
            img_num += 1              
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
            pix1 = None
        pix = None
        p_no += 1

Given below is a sample page of the pdf

Snap taken from the pdf

2
  • for img in doc.getPageImageList(i).sort() maybe? Commented Dec 2, 2020 at 20:24
  • Nope. Doesn't work. I get the following error : TypeError: 'NoneType' object is not iterable Commented Dec 2, 2020 at 20:29

1 Answer 1

1

I have the same problem I've used the following code:

import fitz 
import io
from PIL import Image


file = "file_path"
pdf_file = fitz.open(file)


for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found  {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on the given pdf page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        print(img)
        print(image_index)
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb")) 

The most probable way is to locate the 'img' var and order them. I'd love to hear any further sggestions or if you found better idea/solution.

Sign up to request clarification or add additional context in comments.

1 Comment

I am facing similar issue :( Struggling to match the image with text. While the text can be extracted in order, the images aren't which is a real pain!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.