7

How do I change the hyperlinks in pdf using python? I am currently using a pyPDF2 to open up and loop through the pages. How do I actually scan for hyperlinks and then proceed to change the hyperlinks?

2 Answers 2

8

So I couldn't get what you want using the pyPDF2 library.

I did however get something working with another library: pdfrw. This installed fine for me using pip in Python 3.6:

pip install pdfrw

Note: for the following I have been using this example pdf I found online which contains multiple links. Your mileage may vary with this.

import pdfrw

pdf = pdfrw.PdfReader("pdf.pdf")  # Load the pdf
new_pdf = pdfrw.PdfWriter()  # Create an empty pdf

for page in pdf.pages:  # Go through the pages

    # Links are in Annots, but some pages don't have links so Annots returns None
    for annot in page.Annots or []:

        old_url = annot.A.URI

        # >Here you put logic for replacing the URLs<
        
        # Use the PdfString object to do the encoding for us
        # Note the brackets around the URL here
        new_url = pdfrw.objects.pdfstring.PdfString("(http://www.google.com)")

        # Override the URL with ours
        annot.A.URI = new_url

    new_pdf.addpage(page)    

new_pdf.write("new.pdf")
Sign up to request clarification or add additional context in comments.

1 Comment

For some reasons, I am able to detect the URLs but unable to override it. I use the exact same code but something is not right. Possible to help me out?
2

I managed to get it working with PyPDF2.

If you just want to remove all annotations for a page, you just have to do:

if '/Annots' in page: del page['/Annots']

Else, here is how you change each link:

import PyPDF2

new_link = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # great video by the way

pdf_reader = PyPDF2.PdfFileReader("input.pdf")
pdf_writer = PyPDF2.PdfFileWriter()

for i in range(pdf_reader.getNumPages()):
    page = pdf_reader.getPage(i)
    
    if '/Annots' not in page: continue
    for annot in page['/Annots']:
        annot_obj = annot.getObject()
        if '/A' not in annot_obj: continue  # not a link
        # you have to wrap the key and value with a TextStringObject:
        key   = PyPDF2.generic.TextStringObject("/URI")
        value = PyPDF2.generic.TextStringObject(new_link)
        annot_obj['/A'][key] = value
    
    pdf_writer.addPage(page)

with open('output.pdf', 'wb') as f:
    pdf_writer.write(f)

An equivalent one-liner for a given page index i and annotation index j would be:

pdf_reader.getPage(i)['/Annots'][j].getObject()['/A'][PyPDF2.generic.TextStringObject("/URI")] = PyPDF2.generic.TextStringObject(new_link)

1 Comment

how to open that URL in a new tab, currently it is opening in the same window when opened in browser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.