Export Pandas DataFrame into a PDF file using Python

Question

What is an efficient way to generate PDF for data frames in Pandas?

user3226167 · Accepted Answer · 2020-01-03 07:12:21Z

39

First plot table with matplotlib then generate pdf

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

df = pd.DataFrame(np.random.random((10,3)), columns = ("col 1", "col 2", "col 3"))

#https://stackoverflow.com/questions/32137396/how-do-i-plot-only-a-table-in-matplotlib
fig, ax =plt.subplots(figsize=(12,4))
ax.axis('tight')
ax.axis('off')
the_table = ax.table(cellText=df.values,colLabels=df.columns,loc='center')

#https://stackoverflow.com/questions/4042192/reduce-left-and-right-margins-in-matplotlib-plot
pp = PdfPages("foo.pdf")
pp.savefig(fig, bbox_inches='tight')
pp.close()

reference:

How do I plot only a table in Matplotlib?

Reduce left and right margins in matplotlib plot

edited Jan 3, 2020 at 7:12

answered Jan 3, 2020 at 7:03

user3226167

3,4995 gold badges34 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Merlin Over a year ago

These tables via matplotlib dont look so great, compared to LaTeX or troff for that matter.

Gathide Over a year ago

@Merlin, Can df.to_latex output pdf? What is the process/requirements?

Lak Over a year ago

To improve the look of this (e.g. with alternating colors for the rows), see the answer below stackoverflow.com/a/72957628/3645038

user76170 Over a year ago

The column headers don't come bold. All fonts look the same. I have multiple dataframes to write into a single excel file. eg one dataframe just contains header info (vendor name, address). another contains actual data, 3rd is a footer, which I write to one Excel file using the startrow & startcolumn param in df.to_excel. So I have an excel file which has a structure. Is it possible in Python to export that Excel to pdf?

Dalibor · Accepted Answer · 2017-08-16 08:47:35Z

18

Here is how I do it from sqlite database using sqlite3, pandas and pdfkit

import pandas as pd
import pdfkit as pdf
import sqlite3

con=sqlite3.connect("baza.db")

df=pd.read_sql_query("select * from dobit", con)
df.to_html('/home/linux/izvestaj.html')
nazivFajla='/home/linux/pdfPrintOut.pdf'
pdf.from_file('/home/linux/izvestaj.html', nazivFajla)

answered Aug 16, 2017 at 8:47

Dalibor

3132 silver badges8 bronze badges

2 Comments

Nguai al Over a year ago

pdfkit is not available for windows64

ChrisDanger Over a year ago

Worked great! Pdfkit install on a mac: pip install pdfkit && brew install Caskroom/cask/wkhtmltopdf

wgwz · Accepted Answer · 2015-10-15 19:24:38Z

10

Well one way is to use markdown. You can use df.to_html(). This converts the dataframe into a html table. From there you can put the generated html into a markdown file (.md) (see http://daringfireball.net/projects/markdown/basics). From there, there are utilities to convert markdown into a pdf (https://www.npmjs.com/package/markdown-pdf).

One all-in-one tool for this method is to use Atom text editor (https://atom.io/). There you can use an extension, search "markdown to pdf", which will make the conversion for you.

Note: When using to_html() recently I had to remove extra '\n' characters for some reason. I chose to use Atom -> Find -> '\n' -> Replace "".

Overall this should do the trick!

answered Oct 15, 2015 at 19:24

wgwz

2,7792 gold badges27 silver badges38 bronze badges

2 Comments

Merlin Over a year ago

I think a solution with intermediate steps into HTML and then markdown (which doesn't even have a standard spec), then to pdf, is not a good way.

Duncan MacIntyre Over a year ago

You can now use .to_markdown() to avoid HTML entirely.

R_100 · Accepted Answer · 2021-02-04 13:57:24Z

8

With reference to these two examples that I found useful:

The simple CSS code saved in same folder as ipynb:

/* includes alternating gray and white with on-hover color */

.mystyle {
    font-size: 11pt; 
    font-family: Arial;
    border-collapse: collapse; 
    border: 1px solid silver;

}

.mystyle td, th {
    padding: 5px;
}

.mystyle tr:nth-child(even) {
    background: #E0E0E0;
}

.mystyle tr:hover {
    background: silver;
    cursor: pointer;
}

The python code:

pdf_filepath = os.path.join(folder,file_pdf)
demo_df = pd.DataFrame(np.random.random((10,3)), columns = ("col 1", "col 2", "col 3"))

table=demo_df.to_html(classes='mystyle')

html_string = f'''
<html>
  <head><title>HTML Pandas Dataframe with CSS</title></head>
  <link rel="stylesheet" type="text/css" href="df_style.css"/>
  <body>
    {table}
  </body>
</html>
'''

HTML(string=html_string).write_pdf(pdf_filepath, stylesheets=["df_style.css"])

answered Feb 4, 2021 at 13:57

R_100

1131 silver badge7 bronze badges

5 Comments

remo Over a year ago

What is HTML in the last line?

R_100 Over a year ago

The HTML is generated as a string in the python code. I'm not 100% sure what you meant by your question?

Vaibhav Rai Over a year ago

the HTML is imported from the 'weasyprint' module of python - pypi.org/project/weasyprint

hlongmore Over a year ago

Also note that if your system doesn't have a recent enough version of libpango, you can pin weasyprint==52.5 which does not depend on libpango>=1.44.0

Siddharth Das Over a year ago

For large size dataframe ( 40k rows), I am getting OOM error, any fix for that? @R_100

mit · Accepted Answer · 2018-10-09 13:56:27Z

5

This is a solution with an intermediate pdf file.

The table is pretty printed with some minimal css.

The pdf conversion is done with weasyprint. You need to pip install weasyprint.

# Create a pandas dataframe with demo data:
import pandas as pd
demodata_csv = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(demodata_csv)

# Pretty print the dataframe as an html table to a file
intermediate_html = '/tmp/intermediate.html'
to_html_pretty(df,intermediate_html,'Iris Data')
# if you do not want pretty printing, just use pandas:
# df.to_html(intermediate_html)

# Convert the html file to a pdf file using weasyprint
import weasyprint
out_pdf= '/tmp/demo.pdf'
weasyprint.HTML(intermediate_html).write_pdf(out_pdf)

# This is the table pretty printer used above:

def to_html_pretty(df, filename='/tmp/out.html', title=''):
    '''
    Write an entire dataframe to an HTML file
    with nice formatting.
    Thanks to @stackoverflowuser2010 for the
    pretty printer see https://stackoverflow.com/a/47723330/362951
    '''
    ht = ''
    if title != '':
        ht += '<h2> %s </h2>\n' % title
    ht += df.to_html(classes='wide', escape=False)

    with open(filename, 'w') as f:
         f.write(HTML_TEMPLATE1 + ht + HTML_TEMPLATE2)

HTML_TEMPLATE1 = '''
<html>
<head>
<style>
  h2 {
    text-align: center;
    font-family: Helvetica, Arial, sans-serif;
  }
  table { 
    margin-left: auto;
    margin-right: auto;
  }
  table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
  }
  th, td {
    padding: 5px;
    text-align: center;
    font-family: Helvetica, Arial, sans-serif;
    font-size: 90%;
  }
  table tbody tr:hover {
    background-color: #dddddd;
  }
  .wide {
    width: 90%; 
  }
</style>
</head>
<body>
'''

HTML_TEMPLATE2 = '''
</body>
</html>
'''

Thanks to @stackoverflowuser2010 for the pretty printer, see stackoverflowuser2010's answer https://stackoverflow.com/a/47723330/362951

I did not use pdfkit, because I had some problems with it on a headless machine. But weasyprint is great.

answered Oct 9, 2018 at 13:56

mit

11.3k11 gold badges52 silver badges74 bronze badges

2 Comments

TheDude Over a year ago

Do you know how I can force a page break? Say I have several table slices of a pandas dataframe and I want each table to start on a new page. Is that possible and at what point should I edit the html code?

Nikhil VJ Over a year ago

thanks! how to make it print with landscape orientation / different page size?

Lak · Accepted Answer · 2022-07-14 22:32:36Z

when using Matplotlib, here's how to get a prettier table with alternating colors for the rows, etc. as well as to optionally paginate the PDF:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

def _draw_as_table(df, pagesize):
    alternating_colors = [['white'] * len(df.columns), ['lightgray'] * len(df.columns)] * len(df)
    alternating_colors = alternating_colors[:len(df)]
    fig, ax = plt.subplots(figsize=pagesize)
    ax.axis('tight')
    ax.axis('off')
    the_table = ax.table(cellText=df.values,
                        rowLabels=df.index,
                        colLabels=df.columns,
                        rowColours=['lightblue']*len(df),
                        colColours=['lightblue']*len(df.columns),
                        cellColours=alternating_colors,
                        loc='center')
    return fig
  

def dataframe_to_pdf(df, filename, numpages=(1, 1), pagesize=(11, 8.5)):
  with PdfPages(filename) as pdf:
    nh, nv = numpages
    rows_per_page = len(df) // nh
    cols_per_page = len(df.columns) // nv
    for i in range(0, nh):
        for j in range(0, nv):
            page = df.iloc[(i*rows_per_page):min((i+1)*rows_per_page, len(df)),
                           (j*cols_per_page):min((j+1)*cols_per_page, len(df.columns))]
            fig = _draw_as_table(page, pagesize)
            if nh > 1 or nv > 1:
                # Add a part/page number at bottom-center of page
                fig.text(0.5, 0.5/pagesize[0],
                         "Part-{}x{}: Page-{}".format(i+1, j+1, i*nv + j + 1),
                         ha='center', fontsize=8)
            pdf.savefig(fig, bbox_inches='tight')
            
            plt.close()

Use it as follows:

dataframe_to_pdf(df, 'test_1.pdf')
dataframe_to_pdf(df, 'test_6.pdf', numpages=(3, 2))

Explanation of the code is here: https://levelup.gitconnected.com/how-to-write-a-pandas-dataframe-as-a-pdf-5cdf7d525488

ltspin · Accepted Answer · 2023-11-16 17:27:49Z

I found the answer by @Lak worked best for me. I particularly appreciated the multi-pages options. To have consistent column widths across pages I added some cell_width calculations based on max character length in each column (including the headers). The particular column_width scaling I used is what made my application look the best - it likely will want some case by case fiddling. I also added a conditional flag for including the section/page numbering.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

def _draw_as_table(df, pagesize,col_widths,idx_width):
    alternating_colors = [['white'] * len(df.columns), ['lightgray'] * len(df.columns)] * len(df)
    alternating_colors = alternating_colors[:len(df)]
    fig, ax = plt.subplots(figsize=pagesize)
    ax.axis('tight')
    ax.axis('off')
    the_table = ax.table(cellText=df.values,
                        rowLabels=df.index.map(('{: >'+str(idx_width+3)+'d}').format),
                        colLabels=df.columns,
                        rowColours=['lightblue']*len(df),
                        colColours=['lightblue']*len(df.columns),
                        cellColours=alternating_colors,
                        colWidths=col_widths,
                        fontsize=18, 
                        loc='center')
    the_table.auto_set_font_size(False)
    the_table.scale(1,1.5) #add a little row height padding
    #the_table.auto_set_column_width(col=list(range(len(df.columns))))
    return fig
  

def dataframe_to_pdf(df, filename, numpages=(1, 1), pagesize=(11, 8.5),pagenos = False):
  with PdfPages(filename) as pdf:
    nh, nv = numpages
    rows_per_page = len(df) // nh
    cols_per_page = len(df.columns) // nv
    col_widths = []
    for col in df.columns:
        col_widths += [df[col].astype('str').str.len().max()]
    header_widths = metadf.columns.str.len().to_numpy()
    col_widths = np.max(np.vstack((col_widths,header_widths)),axis=0) 
    col_widths = [max(x /sum(col_widths) * pagesize[0] *0.15,0.1) for x in col_widths] 
                # frac_len * page_width *scaler
    idx_width = df.index.astype('str').str.len().max()
    for i in range(0, nh):
        for j in range(0, nv):
            page = df.iloc[(i*rows_per_page):min((i+1)*rows_per_page, len(df)),
                           (j*cols_per_page):min((j+1)*cols_per_page, len(df.columns))]
            fig = _draw_as_table(page, pagesize,col_widths,idx_width)
            if (nh > 1 or nv > 1) and pagenos:
                # Add a part/page number at bottom-center of page
                fig.text(0.5, 0.5/pagesize[0],
                         "Part-{}x{}: Page-{}".format(i+1, j+1, i*nv + j + 1),
                         ha='center', fontsize=8)
            pdf.savefig(fig, bbox_inches='tight')            
            plt.close()

Collectives™ on Stack Overflow

Export Pandas DataFrame into a PDF file using Python

7 Answers 7

4 Comments

2 Comments

2 Comments

5 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

2 Comments

2 Comments

5 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related