Is anyone can provide example how to create zip file from csv file using Python/Pandas package? Thank you
4 Answers
Use
df.to_csv('my_file.gz', compression='gzip')
From the docs:
compression : string, optional a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
See discussion of support of zip files here.
1 Comment
In the to_csv() method of pandas, besides the compression type (gz, zip etc) you can specify the archive file name - just pass the dict with necessary params as the compression parameter:
compression_opts = dict(method='zip',
archive_name='out.csv')
df.to_csv('out.zip', compression=compression_opts)
In the example above, the first argument of the to_csv method defines the name of the [ZIP] archive file, the method key of the dict defines [ZIP] compression type and the archive_name key of the dict defines the name of the [CSV] file inside the archive file.
Result:
├─ out.zip
│ └─ out.csv
See details in to_csv() pandas docs
Comments
The Pandas to_csv compression has some security vulnerabilities where it leaves the absolute path of the file in the zip archive on Linux machine. Not to mention one might want to save a file in the highest level of a zipped file. The following function addresses this issue by using zipfile. On top of that, it doesn't suffer from pickle protocol change (4 to 5).
from pathlib import Path
import zipfile
def save_compressed_df(df, dirPath, fileName):
"""Save a Pandas dataframe as a zipped .csv file.
Parameters
----------
df : pandas.core.frame.DataFrame
Input dataframe.
dirPath : str or pathlib.PosixPath
Parent directory of the zipped file.
fileName : str
File name without extension.
"""
dirPath = Path(dirPath)
path_zip = dirPath / f'{fileName}.csv.zip'
txt = df.to_csv(index=False)
with zipfile.ZipFile(path_zip, 'w', zipfile.ZIP_DEFLATED) as zf:
zf.writestr(f'{fileName}.csv', txt)
1 Comment
compression argument accepts a dictionary which can specifiy the archive_name inside the zip archive. This works only for zip archives though, for gzip you have to use df.to_csv("/tmp/df.csv.gz", compression="gzip").