I have a PDF file that has around 1000 pages and want to remove some of the pages conditioning on not finding a specific word. For instance, the code would search for a specific word such as "STACKOVER", if it cannot find that word on the page, remove the page and continue to the following page, and at the end saves the file.
-
1Welcome to SO. What have you tried so far?Serge de Gosson de Varennes– Serge de Gosson de Varennes2022-01-21 05:59:07 +00:00Commented Jan 21, 2022 at 5:59
-
1Next time you post a question, you will need to give more information and show attempts. This will save you a lot of grief. Happy coding!Serge de Gosson de Varennes– Serge de Gosson de Varennes2022-01-21 06:21:51 +00:00Commented Jan 21, 2022 at 6:21
-
1Thank you for your comment, I will be more careful when posting next time.Alper D.– Alper D.2022-01-21 08:44:55 +00:00Commented Jan 21, 2022 at 8:44
Add a comment
|
1 Answer
The way to do this is: First, define the search words you are looking for (in my case I tested it on a medical journal and searched for searchwords=['unclear risk for poorly']). Second, find all pages containing the word or string and store the page numbers in a list (pages_to_delete). For safe keeping, I put them i a csv file giving the page in which a specific searchword is found. Third, open to original pdf, delete the pages contained in the list and save to a new pdf.
import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader
pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
searchwords=['unclear risk for poorly']
pages_to_delete = []
with open('Pages.csv', 'w') as f:
f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
for word in searchwords:
for page in range(number_of_pages):
print(page)
pages_text.append(pdfReader.getPage(page).extractText())
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
for page in words:
for i in range(0,len(words[page])):
if str(words[page][i]) != 'nan':
f.write('{0},{1}\n'.format(page+1, words[page][i]))
pages_to_delete.append(page)
infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()
for i in range(infile.getNumPages()):
if i not in pages_to_delete:
p = infile.getPage(i)
output.addPage(p)
with open('Newdddtest.pdf', 'wb') as f:
output.write(f)
Update
If you want to disregard whether the text is bold or not replace
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
with
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]
4 Comments
Alper D.
Thank you for your help. However, the code only works for the search word of "unclear risk for poorly". I looked on the web and found your pdf file titled "Assessing Risk of Bias as a Domain of Quality in Medical Test Studies". For instance, when I want to delete the first page of this file by changing your code as searchwords=['Abstract'], the code does not work. Maybe, I am missing something?
Serge de Gosson de Varennes
Updated the answer! Happy coding!
Alper D.
Thank you for the update. The code is probably color-sensitive. It does not remove the pages that contain colorful words except the black.
Serge de Gosson de Varennes
I tested with Abstract in red. Worked just fine. Would be intresting to see your own file.