Removing pages in a pdf file conditioning on something using Python

Question

I have a PDF file that has around 1000 pages and want to remove some of the pages conditioning on not finding a specific word. For instance, the code would search for a specific word such as "STACKOVER", if it cannot find that word on the page, remove the page and continue to the following page, and at the end saves the file.

Next time you post a question, you will need to give more information and show attempts. This will save you a lot of grief. Happy coding! — Serge de Gosson de Varennes
– Serge de Gosson de Varennes, Commented Jan 21, 2022 at 6:21
Thank you for your comment, I will be more careful when posting next time. — Alper D.
– Alper D., Commented Jan 21, 2022 at 8:44

Serge de Gosson de Varennes · Accepted Answer · 2022-01-21 09:01:58Z

1

The way to do this is: First, define the search words you are looking for (in my case I tested it on a medical journal and searched for searchwords=['unclear risk for poorly']). Second, find all pages containing the word or string and store the page numbers in a list (pages_to_delete). For safe keeping, I put them i a csv file giving the page in which a specific searchword is found. Third, open to original pdf, delete the pages contained in the list and save to a new pdf.

import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader

pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

searchwords=['unclear risk for poorly']

pages_to_delete = []

with open('Pages.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
                if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    pages_to_delete.append(page)
                    

infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()

for i in range(infile.getNumPages()):
    if i not in pages_to_delete:
        p = infile.getPage(i)
        output.addPage(p)

with open('Newdddtest.pdf', 'wb') as f:
    output.write(f)

Update

If you want to disregard whether the text is bold or not replace

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]

with

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]

edited Jan 21, 2022 at 9:01

answered Jan 21, 2022 at 6:18

Serge de Gosson de Varennes

11.6k4 gold badges30 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Alper D. Over a year ago

Thank you for your help. However, the code only works for the search word of "unclear risk for poorly". I looked on the web and found your pdf file titled "Assessing Risk of Bias as a Domain of Quality in Medical Test Studies". For instance, when I want to delete the first page of this file by changing your code as searchwords=['Abstract'], the code does not work. Maybe, I am missing something?

Serge de Gosson de Varennes Over a year ago

Updated the answer! Happy coding!

Alper D. Over a year ago

Thank you for the update. The code is probably color-sensitive. It does not remove the pages that contain colorful words except the black.

Serge de Gosson de Varennes Over a year ago

I tested with Abstract in red. Worked just fine. Would be intresting to see your own file.

Collectives™ on Stack Overflow

Removing pages in a pdf file conditioning on something using Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related