2

I have a PDF file that has around 1000 pages and want to remove some of the pages conditioning on not finding a specific word. For instance, the code would search for a specific word such as "STACKOVER", if it cannot find that word on the page, remove the page and continue to the following page, and at the end saves the file.

3
  • 1
    Welcome to SO. What have you tried so far? Commented Jan 21, 2022 at 5:59
  • 1
    Next time you post a question, you will need to give more information and show attempts. This will save you a lot of grief. Happy coding! Commented Jan 21, 2022 at 6:21
  • 1
    Thank you for your comment, I will be more careful when posting next time. Commented Jan 21, 2022 at 8:44

1 Answer 1

1

The way to do this is: First, define the search words you are looking for (in my case I tested it on a medical journal and searched for searchwords=['unclear risk for poorly']). Second, find all pages containing the word or string and store the page numbers in a list (pages_to_delete). For safe keeping, I put them i a csv file giving the page in which a specific searchword is found. Third, open to original pdf, delete the pages contained in the list and save to a new pdf.

import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader

pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

searchwords=['unclear risk for poorly']

pages_to_delete = []

with open('Pages.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
                if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    pages_to_delete.append(page)
                    

infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()

for i in range(infile.getNumPages()):
    if i not in pages_to_delete:
        p = infile.getPage(i)
        output.addPage(p)

with open('Newdddtest.pdf', 'wb') as f:
    output.write(f)

Update

If you want to disregard whether the text is bold or not replace

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]

with

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your help. However, the code only works for the search word of "unclear risk for poorly". I looked on the web and found your pdf file titled "Assessing Risk of Bias as a Domain of Quality in Medical Test Studies". For instance, when I want to delete the first page of this file by changing your code as searchwords=['Abstract'], the code does not work. Maybe, I am missing something?
Updated the answer! Happy coding!
Thank you for the update. The code is probably color-sensitive. It does not remove the pages that contain colorful words except the black.
I tested with Abstract in red. Worked just fine. Would be intresting to see your own file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.