How to convert webpage into PDF by using Python

Question

I was finding solution to print webpage into local file PDF, using Python. one of the good solution is to use Qt, found here, https://bharatikunal.wordpress.com/2010/01/.

It didn't work at the beginning as I had problem with the installation of PyQt4 because it gave error messages such as 'ImportError: No module named PyQt4.QtCore', and 'ImportError: No module named PyQt4.QtCore'.

It was because PyQt4's not installed properly. I used to have the libraries located at C:\Python27\Lib however it's not for PyQt4.

In fact, it simply needs to download from http://www.riverbankcomputing.com/software/pyqt/download (mind the correct Python version you are using), and install it to C:\Python27 (my case). That's it.

Now the scripts runs fine so I want to share it. for more options in using Qprinter, please refer to http://qt-project.org/doc/qt-4.8/qprinter.html#Orientation-enum.

Note that you can post a Q&A simultaneously if you're self-answering, and the usual quality rules still apply to both parts. — jonrsharpe
– jonrsharpe, Commented Jan 29, 2018 at 22:24

Eddie Reasoner · Accepted Answer · 2025-11-26 17:44:23Z

196

You also can use pdfkit:

Usage

import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')

Install

MacOS: brew install Caskroom/cask/wkhtmltopdf

Debian/Ubuntu: apt-get install wkhtmltopdf

Windows: choco install wkhtmltopdf

See official documentation for MacOS/Ubuntu/other OS: https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

edited 2 days ago

Eddie Reasoner

519 bronze badges

answered May 20, 2014 at 13:24

NorthCat

10k16 gold badges51 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Dowlers Over a year ago

This is awesome, way easier than messing around with reportlab or using a print drive to convert. Thanks so much.

Tim Ludwinski Over a year ago

PDFKit requires a running X Server (or "virtual" X Server). :( See here: github.com/JazzCore/python-pdfkit/wiki/…

Tinmarino Over a year ago

Perfect !! Even download the embeded images, don't bother use that ! You'll have to apt-get install wkhtmltopdf

Rasmus Kaj Over a year ago

pdfkit depends on non-python package wkhtmltopdf, which in turn requires a running X server. So while nice in some environments, this is not an answer that works generally in python.

Salem Over a year ago

This package seems to not be maintained anymore... github.com/JazzCore/python-pdfkit/issues/242

|

Sunit Gautam · Accepted Answer · 2020-08-12 08:48:25Z

75

WeasyPrint

pip install weasyprint  # No longer supports Python 2.x.

python
>>> import weasyprint
>>> pdf = weasyprint.HTML('http://www.google.com').write_pdf()
>>> len(pdf)
92059
>>> open('google.pdf', 'wb').write(pdf)

edited Aug 12, 2020 at 8:48

Sunit Gautam

6,0953 gold badges21 silver badges34 bronze badges

answered Dec 23, 2015 at 15:04

JohnMudd

13.8k2 gold badges31 silver badges24 bronze badges

10 Comments

Piyush S. Wanare Over a year ago

Can I provide file path instead of url?

stvsmth Over a year ago

I think I will prefer this project as it's dependencies are python packages rather than a system package. As of Jan 2018 it seems to have more frequent updates and better documentation.

visoft Over a year ago

There are too many things to install. I stopped at libpango and went for the pdfkit. Nasty for system wide wkhtmltopdf but weasyprint also require some system wide installs.

suhailvs Over a year ago

this won't convert javascripts in the html file. for that you need to use pdfkit

Anatoly Scherbakov Over a year ago

I would believe the option should be 'wb', not 'w', because pdf is a bytes object.

|

Community · Accepted Answer · 2017-05-23 12:18:23Z

27

thanks to below posts, and I am able to add on the webpage link address to be printed and present time on the PDF generated, no matter how many pages it has.

Add text to Existing PDF using Python

https://github.com/disflux/django-mtr/blob/master/pdfgen/doc_overlay.py

To share the script as below:

import time
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from xhtml2pdf import pisa
import sys 
from PyQt4.QtCore import *
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

url = 'http://www.yahoo.com'
tem_pdf = "c:\\tem_pdf.pdf"
final_file = "c:\\younameit.pdf"

app = QApplication(sys.argv)
web = QWebView()
#Read the URL given
web.load(QUrl(url))
printer = QPrinter()
#setting format
printer.setPageSize(QPrinter.A4)
printer.setOrientation(QPrinter.Landscape)
printer.setOutputFormat(QPrinter.PdfFormat)
#export file as c:\tem_pdf.pdf
printer.setOutputFileName(tem_pdf)

def convertIt():
    web.print_(printer)
    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)

app.exec_()
sys.exit

# Below is to add on the weblink as text and present date&time on PDF generated

outputPDF = PdfFileWriter()
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 9)
# Writting the new line
oknow = time.strftime("%a, %d %b %Y %H:%M")
can.drawString(5, 2, url)
can.drawString(605, 2, oknow)
can.save()

#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file(tem_pdf, "rb"))
pages = existing_pdf.getNumPages()
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
for x in range(0,pages):
    page = existing_pdf.getPage(x)
    page.mergePage(new_pdf.getPage(0))
    output.addPage(page)
# finally, write "output" to a real file
outputStream = file(final_file, "wb")
output.write(outputStream)
outputStream.close()

print final_file, 'is ready.'

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Apr 30, 2014 at 7:31

Mark K

9,50615 gold badges70 silver badges133 bronze badges

8 Comments

sam2426679 Over a year ago

Thanks for sharing your code! Any advice for making this work for local pdf files? Or is it as easy as prepending "file:///" to the url? I'm not very familiar with these libraries... thanks

Mark K Over a year ago

@user2426679, you mean convert online PDF into local PDF files?

sam2426679 Over a year ago

thanks for your reply... sorry for my tardiness. I ended up using wkhtmltopdf since it was able to handle what I was throwing at it. But I was asking how to load a pdf that was local to my hdd. Cheers

Mark K Over a year ago

@user2426679 sorry I still don't get you. maybe because I am a newbie to Python too. You meant read local PDF files in Python?

Blairg23 Over a year ago

There were some issues with html5lib, which is used by xhtml2pdf. This solution fixed the problem: github.com/xhtml2pdf/xhtml2pdf/issues/318

|

Jean-François Fabre · Accepted Answer · 2021-03-14 20:38:25Z

16

Per this answer: How to convert webpage into PDF by using Python, the advice was to use pdfkit. You also have to install wkhtmltopdf.

If you have a local .html file, you then need to use this command:

pdfkit.from_file('test.html', 'out.pdf')

But this will throw an error if you haven't added the wkhtmltopdf executables to your system path. This was the part that tripped me up and I wanted to share.

On Windows, open your environment variables and add them to your System variables > Path like below. In my case, these .exe files were located here after I installed the wkhtmltopdf from an exe:

C:\Program Files\wkhtmltopdf\bin

edited Mar 14, 2021 at 20:38

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

answered Jan 29, 2018 at 22:31

Jarad

19.2k20 gold badges105 silver badges165 bronze badges

1 Comment

kudo_shinichi Over a year ago

I was facing the same issue on Win10, this helped, thanks a ton.

FractalSpace · Accepted Answer · 2019-10-26 17:32:19Z

15

here is the one working fine:

import sys 
from PyQt4.QtCore import *
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

app = QApplication(sys.argv)
web = QWebView()
web.load(QUrl("http://www.yahoo.com"))
printer = QPrinter()
printer.setPageSize(QPrinter.A4)
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setOutputFileName("fileOK.pdf")

def convertIt():
    web.print_(printer)
    print("Pdf generated")
    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)
sys.exit(app.exec_())

edited Oct 26, 2019 at 17:32

FractalSpace

5,7054 gold badges47 silver badges50 bronze badges

answered Apr 29, 2014 at 8:11

Mark K

9,50615 gold badges70 silver badges133 bronze badges

2 Comments

amergin Over a year ago

Interestingly, the web page links are generated as text rather than links in the generated PDF.

boson Over a year ago

Anyone know why this would be generating blank pdfs for me?

Jim Paul · Accepted Answer · 2015-03-12 13:31:32Z

11

Here is a simple solution using QT. I found this as part of an answer to a different question on StackOverFlow. I tested it on Windows.

from PyQt4.QtGui import QTextDocument, QPrinter, QApplication

import sys
app = QApplication(sys.argv)

doc = QTextDocument()
location = "c://apython//Jim//html//notes.html"
html = open(location).read()
doc.setHtml(html)

printer = QPrinter()
printer.setOutputFileName("foo.pdf")
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setPageSize(QPrinter.A4);
printer.setPageMargins (15,15,15,15,QPrinter.Millimeter);

doc.print_(printer)
print "done!"

edited Mar 12, 2015 at 13:31

answered Jan 20, 2015 at 20:38

Jim Paul

1911 silver badge4 bronze badges

Comments

Mark K · Accepted Answer · 2019-10-18 02:09:54Z

9

I tried @NorthCat answer using pdfkit.

It required wkhtmltopdf to be installed. The install can be downloaded from here. https://wkhtmltopdf.org/downloads.html

Install the executable file. Then write a line to indicate where wkhtmltopdf is, like below. (referenced from Can't create pdf using python PDFKIT Error : " No wkhtmltopdf executable found:"

import pdfkit


path_wkthmltopdf = "C:\\Folder\\where\\wkhtmltopdf.exe"
config = pdfkit.configuration(wkhtmltopdf = path_wkthmltopdf)

pdfkit.from_url("http://google.com", "out.pdf", configuration=config)

answered Oct 18, 2019 at 2:09

Mark K

9,50615 gold badges70 silver badges133 bronze badges

1 Comment

mLstudent33 Over a year ago

where did it go after I clicked .deb and installed on software centre?

Y.kh · Accepted Answer · 2020-08-06 19:39:19Z

7

This solution worked for me using PyQt5 version 5.15.0

import sys
from PyQt5 import QtWidgets, QtWebEngineWidgets
from PyQt5.QtCore import QUrl
from PyQt5.QtGui import QPageLayout, QPageSize
from PyQt5.QtWidgets import QApplication

if __name__ == '__main__':
    app = QtWidgets.QApplication(sys.argv)
    loader = QtWebEngineWidgets.QWebEngineView()
    loader.setZoomFactor(1)
    layout = QPageLayout()
    layout.setPageSize(QPageSize(QPageSize.A4Extra))
    layout.setOrientation(QPageLayout.Portrait)
    loader.load(QUrl('https://stackoverflow.com/questions/23359083/how-to-convert-webpage-into-pdf-by-using-python'))
    loader.page().pdfPrintingFinished.connect(lambda *args: QApplication.exit())

    def emit_pdf(finished):
        loader.page().printToPdf("test.pdf", pageLayout=layout)

    loader.loadFinished.connect(emit_pdf)
    sys.exit(app.exec_())

answered Aug 6, 2020 at 19:39

Y.kh

2033 silver badges7 bronze badges

4 Comments

Dan Over a year ago

I tried this and get this error: Traceback (most recent call last): File "C:/Users/brentond/Documents/Python/PdfWebsite.py", line 2, in <module> from PyQt5 import QtWidgets, QtWebEngineWidgets ImportError: DLL load failed: The specified module could not be found.

Y.kh Over a year ago

You have to install the PyQt5 package first: pip install PyQt5

Dan Over a year ago

I do have it installed... But as far as I can see there is no PyQt5 method called QtwebEngineWidgets... At least not in 5.15.2 that I have installed in PyCharm.

Dániel Kis-Nagy Over a year ago

You also need to pip install PyQtWebEngine for this to work

David Golembiowski · Accepted Answer · 2023-08-16 09:37:25Z

6

If you use selenium and chromium, you do not need to manage cookies by you self, and you can generate pdf page from chromium's print as pdf. You can refer this project to realize it. https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter

modified base > https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter/blob/master/sample/html_to_pdf_converter.py

import sys
import json, base64


def send_devtools(driver, cmd, params={}):
    resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
    url = driver.command_executor._url + resource
    body = json.dumps({'cmd': cmd, 'params': params})
    response = driver.command_executor._request('POST', url, body)
    return response.get('value')


def get_pdf_from_html(driver, url, print_options={}, output_file_path="example.pdf"):
    driver.get(url)

    calculated_print_options = {
        'landscape': False,
        'displayHeaderFooter': False,
        'printBackground': True,
        'preferCSSPageSize': True,
    }
    calculated_print_options.update(print_options)
    result = send_devtools(driver, "Page.printToPDF", calculated_print_options)
    data = base64.b64decode(result['data'])
    with open(output_file_path, "wb") as f:
        f.write(data)



# example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import shutil

# Check for the existence of the chromedriver executable
chromedriver = shutil.which("chromedriver")
assert chromedriver is not None, "chromedriver not on PATH"

url = "https://stackoverflow.com/questions/23359083/how-to-convert-webpage-into-pdf-by-using-python#"
webdriver_options = Options()
webdriver_options.add_argument("--no-sandbox")
webdriver_options.add_argument('--headless')
webdriver_options.add_argument('--disable-gpu')

driver = webdriver.Chrome(chromedriver, options=webdriver_options)
get_pdf_from_html(driver, url)
driver.quit()

edited Aug 16, 2023 at 9:37

David Golembiowski

1751 gold badge4 silver badges16 bronze badges

answered Jul 26, 2020 at 13:31

Yuanmeng Xiao

3144 silver badges8 bronze badges

7 Comments

Yuanmeng Xiao Over a year ago

Firstly i use weasyprint but it do not support cookies even you can write your own default_url_fetcher to handle cookies but later i occur issue when install it in Ubuntu16.Then i use wkhtmltopdf it suport cookie setting but it caused many OSERROR like -15 -11 when handle some page.

Mark K Over a year ago

Thank you for sharing Mr. @Yuanmeng Xiao.

Dan Over a year ago

Hi @YuanmengXiao I copied your code above and I get this error: Traceback (most recent call last): File "C:/Users/brentond/Documents/Python/PdfWebsite.py", line 39, in <module> driver = webdriver.Chrome(chromedriver, options=webdriver_options) NameError: name 'chromedriver' is not defined

Dan Over a year ago

I then installed a module called chromedriver and imported it to the above code and now get this error Traceback (most recent call last): File "C:/Users/brentond/Documents/Python/PdfWebsite.py", line 33, in <module> import chromedriver File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\chromedriver_init_.py", line 16, in <module> raise RuntimeError('This package supports only Linux, MacOSX or Windows platforms') RuntimeError: This package supports only Linux, MacOSX or Windows platforms

Yuanmeng Xiao Over a year ago

you should download chromedrver from chromedriver.chromium.org And you would better learn how to use selenium to driver chrome browser.

|

blue-zircon · Accepted Answer · 2022-09-27 05:21:28Z

0

As explained by another answer; if you have .html files locally you can use the following:

pdfkit.from_file('abc.html', 'abc.pdf')

Additionally, if your source html file has img tags src should be the relative path and you have to include this option to allow local file access.

pdfkit.from_file('abc.html', 'abc.pdf',options={"enable-local-file-access": ""})

Otherwise you may run into the following error

OSError: wkhtmltopdf reported an error: Exit with code 1 due to network error: ProtocolUnknownError

Source: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2660#issuecomment-663063752

pdfkit error: Exit with code 1 due to network error: ProtocolUnknownError

answered Sep 27, 2022 at 5:21

blue-zircon

3162 silver badges8 bronze badges

Collectives™ on Stack Overflow

How to convert webpage into PDF by using Python

10 Answers 10

Usage

Install

18 Comments

10 Comments

8 Comments

1 Comment

2 Comments

Comments

1 Comment

4 Comments

7 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

Usage

Install

18 Comments

10 Comments

8 Comments

1 Comment

2 Comments

Comments

1 Comment

4 Comments

7 Comments

Comments

Linked

Related