I've been trying to get the hang of multithreading in Python. However, whenever I attempt to make it do something that might be considered useful, I run into issues.
In this case, I have 300 PDF files. For simplicity, we'll assume that each PDF only has a unique number on it (say 1 to 300). I'm trying to make Python open the file, grab the text from it, and then use that text to rename the file accordingly.
The non-multithreaded version I make works amazing. But it's a bit slow and I thought I'd see if I could speed it up a bit. However, this version finds the very first file, renames it correctly, and then throws an error saying:
FileNotFoundError: [Errno 2] No such file or directory: './pdfPages/1006941.pdf'
Which is it basically telling me that it can't find a file by that name. The reason it can't is because it already named it. And in my head that tells me that I've probably messed something up with this loop and/or multithreading in general.
Any help would be appreciated.
Source:
import PyPDF2
import os
from os import listdir
from os.path import isfile, join
from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing.dummy import Pool as ThreadPool
# Global
i=0
def readPDF(allFiles):
print(allFiles)
global i
while i < l:
i+=1
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
pdf_file.close()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
os.rename(path+allFiles,path+pre+"-"+Text+".PDF")
pre = "77"
path = "./pdfPages/"
included_extensions = ['pdf','PDF']
allFiles = [f for f in listdir(path) if any(f.endswith(ext) for ext in included_extensions)] # Get all files in current directory
l = len(allFiles)
pool = ThreadPool(4)
doThings = pool.map(readPDF, allFiles)
pool.close()
pool.join()