How can I optimize a Python script to process large CSV files efficiently?

Question

I'm working on a Python project that involves processing large CSV files (2–5 GB in size). The script reads the CSV file, performs data transformations, and writes the output to a new file. However, it's running very slowly and consuming a lot of memory.

Here is the current approach I'm using:

import csv

with open('large_file.csv', 'r') as infile, open('output_file.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        # Perform some transformation (e.g., clean data, filter rows)
        if int(row[2]) > 1000:  # Example filter condition
            writer.writerow(row)

Issues:

The script takes several hours to complete. It consumes a lot of memory, which sometimes causes crashes on my machine with 8 GB of RAM.

Reading the file in chunks using pandas:

import pandas as pd

chunks = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunks:
    # Transformation logic here

This improved the memory usage but didn't make a significant difference in speed.

Experimented with csv.DictReader for more readable transformations, but the performance was the same.

My question: How can I optimize this script to process the CSV file more efficiently in terms of both speed and memory usage? Are there Python libraries or techniques specifically designed for handling such large datasets?

Have you tried playing around with larger/different buffer size for the output file? eg. open('output_file.csv', 'w', newline='', buffering=65536) — Mathias R. Jessen
– Mathias R. Jessen, Commented Dec 16, 2024 at 9:54
Just to reiterate: Are you sure that this exact code also causes these problems? There is a possibility that the data transformations are the issue. — Jeyekomon
– Jeyekomon, Commented Dec 16, 2024 at 10:00
This code as is should only store a single row in memory at any one time, thus use virtually zero memory, and it certainly shouldn't increase. — deceze
– deceze ♦, Commented Dec 16, 2024 at 10:07
@LiorDahan "I've checked the data transformations, and they're fine" I don't believe you. I never believe anyone. Please post some code that actually causes the trouble. The csv module is well optimized and should have no problems to handle GBs of data. — Jeyekomon
– Jeyekomon, Commented Dec 16, 2024 at 10:10

Ali Raza · Accepted Answer · 2024-12-16 10:05:26Z

0

The general principle for optimizations is to identify the bottle neck and then optimize it.

It's difficult to say what the issue exactly is in your case because we can't see what transformations are being done. Any inefficient transformations will have a detrimental performance impact because they runs repeatedly for each row.

In terms of Python libraries or techniques designed to handle large datasets, there's a ton. Parallel computing is a whole field of computer science. You can look into Apache Spark which can distribute the processing of large datasets onto multiple clusters to be run simultaneously.

answered Dec 16, 2024 at 10:05

Ali Raza

3131 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Lior Dahan Dec 16, 2024 at 10:09

Thanks! I’ve checked the transformations and think the problem is with the code. I’ll look into using Apache Spark or parallel computing to improve performance

Jeyekomon Dec 16, 2024 at 10:17

He has trouble processing a CSV file and you recommend him an Apache Spark?

Ali Raza Dec 16, 2024 at 10:28

I'm answering his question, "Are there Python libraries or techniques specifically designed for handling such large datasets?" Furthermore, he doesn't have trouble processing the file. The issue is with the speed at which it processes.

Movpasd · Accepted Answer · 2024-12-16 10:35:08Z

It's hard to tell what can be done without knowing more details about the data transformations you're performing.

csv is (I believe) written in pure Python, whereas pandas.read_csv defaults to a C extension [0], which should be more performant. However, the fact that performance isn't improved in the pandas case suggests that you're not bottlenecked by I/O but by your computations -- profiling your code would confirm that.

In general, during your data manipulations, pandas should be a lot more memory-performant than csv, because instead of storing your rows in Python objects, they are stored in arrays behind the scenes.

The general principle for performant in-memory data manipulation for Python is to vectorize, vectorize, vectorize. Every Python for loop, function call, and variable binding incurs overhead. If you can push these operations into libraries that convert them to machine code, you will see significant performance improvements. This means avoiding direct indexing of cells in your DataFrames, and, most importantly, avoiding tight for loops at all costs [1]. The problem is this is not always possible or convenient when you have dependencies between computations on different rows [2].

You might also want to have a look at polars. It is a dataframe library like pandas, but has a different, declarative API. polars also has a streaming API with scan_csv and sink_csv, which may be what you want here. The caveat is this streaming API is experimental and not yet very well documented.

For a 2-5GB CSV file, though, on an 8GB machine, I think you should be able to load the whole thing in memory, especially given the inefficiency of CSV files will get reduced once converted to the in-memory Arrow data format.

[0] In some cases pandas falls back to pure Python for reading, so you might want to make sure you're not falling into one of those cases.

[1] I find this is usually where Python performance issues are most apparent, because such computations run for every single row in your dataset. The first question with performance is often less "how do I my operations quicker?", but "what am I doing a lot of?". That's why profiling is so important.

[2] You can sometimes get around this with judicious use of shifting.

ok i will try and understand what you said thank so much forthe effort
It's hard to tell what can be done without knowing more details if the question is vague your answer will be vague too: they're both useless for the community.

Collectives™ on Stack Overflow

How can I optimize a Python script to process large CSV files efficiently?

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related