I'm working on a Python project that involves processing large CSV files (2–5 GB in size). The script reads the CSV file, performs data transformations, and writes the output to a new file. However, it's running very slowly and consuming a lot of memory.
Here is the current approach I'm using:
import csv
with open('large_file.csv', 'r') as infile, open('output_file.csv', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
# Perform some transformation (e.g., clean data, filter rows)
if int(row[2]) > 1000: # Example filter condition
writer.writerow(row)
Issues:
The script takes several hours to complete. It consumes a lot of memory, which sometimes causes crashes on my machine with 8 GB of RAM.
Reading the file in chunks using pandas:
import pandas as pd
chunks = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunks:
# Transformation logic here
This improved the memory usage but didn't make a significant difference in speed.
Experimented with csv.DictReader for more readable transformations, but the performance was the same.
My question: How can I optimize this script to process the CSV file more efficiently in terms of both speed and memory usage? Are there Python libraries or techniques specifically designed for handling such large datasets?
open('output_file.csv', 'w', newline='', buffering=65536)csvmodule is well optimized and should have no problems to handle GBs of data.