6

I have 110 PDFs that I'm trying to extract images from. Once the images are extracted, I'd like to remove any duplicates and delete images that are less than 4KB. My code to do that looks like this:

def extract_images_from_file(pdf_file):
    file_name = os.path.splitext(os.path.basename(pdf_file))[0]
    call(["pdfimages", "-png", pdf_file, file_name])
    os.remove(pdf_file)

def dedup_images():
    os.mkdir("unique_images")
    md5_library = []
    images = glob("*.png")
    print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..."
    for image in images:
        if os.path.getsize(image) <= 4000:
            os.remove(image)
        else:
            m = md5.new()
            image_data = list(Image.open(image).getdata())
            image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data])
            m.update(image_string)
            md5_library.append([image, m.digest()])
    headers = ['image_file', 'md5']
    dat = pd.DataFrame(md5_library, columns=headers).sort(['md5'])
    dat.drop_duplicates(subset="md5", inplace=True)

    print "Extracting the unique images."
    unique_images = dat.image_file.tolist()
    for image in unique_images:
        old_file = image
        new_file = "unique_images\\" + image
        shutil.copy(old_file, new_file)

This process can take a while, so I've started to dabble in multithreading. Feel free to interpret that as me saying I have no idea what I'm doing. I thought the process would be easily parallelisable with regard to extracting the images, but not deduping since there's a lot of I/O going on with one file and I have no idea how to do that. So here's my attempt at the parallel process:

if __name__ == '__main__':
    filepath = sys.argv[1]
    folder_name = os.getcwd() + "\\all_images\\"
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)
    pdfs = glob("*.pdf")
    print "Copying all PDFs to the images folder..."
    for pdf in pdfs:
        shutil.copy(pdf, ".\\all_images\\")
    os.chdir("all_images")
    pool = Pool(processes=8)
    print "Extracting images from PDFs..."
    pool.map(extract_images_from_file, pdfs)
    print "Extracting unique images into a new folder..."
    dedup_images()
    print "All images have been extracted and deduped."

Everything seems to have worked fine when extracting the images, but then it all went haywire. So here are my questions:

1) Am I setting up the parallel process correctly?
2) Does it continue to try to use all 8 processors on dedup_images()?
3) Is there anything I'm missing and/or not doing correctly?

Thanks in advance!

EDIT Here is what I mean by "haywire". The errors start out with a bunch of lines like this:

I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey   of1pi0e
l2ne1  1i'4mS auogbiepl o2fefinrlaee e N@'egSwmu abYipolor ekcn oaCm o Nupentwt  y1Y -o18r16k11 8.C1po4nu gn3't4
y7 5160120821143  3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C
3o-u3l6d0n.'ptn go'p
en image file 'Ia/ ON eEwr rYoorr:k  CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o  uiolmidalng2'eft r m '
ai gpceoo emfn iapl teN  e1'w-S 8uY6bo2pr.okpe nnCgao' u
Nnetwy  Y1o0r2k8 1C4o u3n4t7y9 918181881134  3p4t7 536-1306211.3p npgt'
4-879.png'
I/O Error: CoulId/nO' tE rorpoern:  iCmoaugled nf'itl eo p'eub piomeangae  fNielwe  Y'oSrukb pCooeunnat yN e1w0 2Y8o1r
4k  3C4o7u9n9t8y8 811032 1p1t4  3o-i3l622f pt 1-863.png'

And then gets more readable with multiple lines like this:

I/O Error: Couldn't open image file 'pt 1-864.png'
I/O Error: Couldn't open image file 'pt 1-865.png'
I/O Error: Couldn't open image file 'pt 1-866.png'
I/O Error: Couldn't open image file 'pt 1-867.png'

This repeats for a while, going back and forth between the garbled error text and the readable.

Finally, it gets to here:

Deleting images smaller than 4KB and generating the MD5 hash values for all other images...
Extracting unique images into a new folder...

which implies that the code picks back up and continues on with the process. What could be going wrong?

3
  • 1
    That looks OK to me. Can you be more specific about "went haywire"? Commented Oct 2, 2015 at 15:48
  • @strubbly I've added the error output above. Commented Oct 2, 2015 at 19:29
  • "I've started to dabble in multithreading. Feel free to interpret that as me saying I have no idea what I'm doing" You and everyone else who starts to work with concurrency. Commented Oct 2, 2015 at 22:34

2 Answers 2

3

Your code is basically fine.

The garbled text is all the processes trying to write different versions of the I/O Error message interleaved to the console. The error message is being generated by the pdfimages command, probably because when you run two at once they conflict, possibly over temporary files, or both using the same file name or something like that.

Try using a different image root for each separate pdf file.

Sign up to request clarification or add additional context in comments.

2 Comments

I've accepted this as the answer because it effectively solved the issue I was having. I appended a random 3-digit alphanumeric code to the root name and it completely alleviated any issues. Thanks!
Cool - you're doing fine with the multiprocessing - just bear in mind that the things you call need to be able to run together. They can conflict when they share resources like directories or files.
3
  1. Yes, Pool.map takes a function taking 1 argument, and then a list, each element of which is passed as an argument to the first function.
  2. No, because everything you have written here runs in the original process except for the body of extract_images_from_file(). Also, I'll point out that you're using 8 processes, not processors. If you happen to have an 8-core Intel CPU, with Hyperthreading enabled you'd be able to run 16 processes concurrently.
  3. It looks fine to me, except that, if extract_images_from_file() throws an exception, it will nuke your entire Pool, which is probably not what you want. To prevent this, you can put a try around that block.

What's the nature of the "haywire" you're dealing with? Can we see the exception text?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.