-3

I'm working on a Python project that involves processing large CSV files (2–5 GB in size). The script reads the CSV file, performs data transformations, and writes the output to a new file. However, it's running very slowly and consuming a lot of memory.

Here is the current approach I'm using:

import csv

with open('large_file.csv', 'r') as infile, open('output_file.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        # Perform some transformation (e.g., clean data, filter rows)
        if int(row[2]) > 1000:  # Example filter condition
            writer.writerow(row)

Issues:

The script takes several hours to complete. It consumes a lot of memory, which sometimes causes crashes on my machine with 8 GB of RAM.

Reading the file in chunks using pandas:

import pandas as pd

chunks = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunks:
    # Transformation logic here

This improved the memory usage but didn't make a significant difference in speed.

Experimented with csv.DictReader for more readable transformations, but the performance was the same.

My question: How can I optimize this script to process the CSV file more efficiently in terms of both speed and memory usage? Are there Python libraries or techniques specifically designed for handling such large datasets?

8
  • Have you read stackoverflow.com/questions/37732466/… Commented Dec 16, 2024 at 9:54
  • 1
    Have you tried playing around with larger/different buffer size for the output file? eg. open('output_file.csv', 'w', newline='', buffering=65536) Commented Dec 16, 2024 at 9:54
  • 2
    Just to reiterate: Are you sure that this exact code also causes these problems? There is a possibility that the data transformations are the issue. Commented Dec 16, 2024 at 10:00
  • 3
    This code as is should only store a single row in memory at any one time, thus use virtually zero memory, and it certainly shouldn't increase. Commented Dec 16, 2024 at 10:07
  • 5
    @LiorDahan "I've checked the data transformations, and they're fine" I don't believe you. I never believe anyone. Please post some code that actually causes the trouble. The csv module is well optimized and should have no problems to handle GBs of data. Commented Dec 16, 2024 at 10:10

2 Answers 2

0

The general principle for optimizations is to identify the bottle neck and then optimize it.

It's difficult to say what the issue exactly is in your case because we can't see what transformations are being done. Any inefficient transformations will have a detrimental performance impact because they runs repeatedly for each row.

In terms of Python libraries or techniques designed to handle large datasets, there's a ton. Parallel computing is a whole field of computer science. You can look into Apache Spark which can distribute the processing of large datasets onto multiple clusters to be run simultaneously.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! I’ve checked the transformations and think the problem is with the code. I’ll look into using Apache Spark or parallel computing to improve performance
He has trouble processing a CSV file and you recommend him an Apache Spark?
I'm answering his question, "Are there Python libraries or techniques specifically designed for handling such large datasets?" Furthermore, he doesn't have trouble processing the file. The issue is with the speed at which it processes.
0

It's hard to tell what can be done without knowing more details about the data transformations you're performing.

csv is (I believe) written in pure Python, whereas pandas.read_csv defaults to a C extension [0], which should be more performant. However, the fact that performance isn't improved in the pandas case suggests that you're not bottlenecked by I/O but by your computations -- profiling your code would confirm that.

In general, during your data manipulations, pandas should be a lot more memory-performant than csv, because instead of storing your rows in Python objects, they are stored in arrays behind the scenes.

The general principle for performant in-memory data manipulation for Python is to vectorize, vectorize, vectorize. Every Python for loop, function call, and variable binding incurs overhead. If you can push these operations into libraries that convert them to machine code, you will see significant performance improvements. This means avoiding direct indexing of cells in your DataFrames, and, most importantly, avoiding tight for loops at all costs [1]. The problem is this is not always possible or convenient when you have dependencies between computations on different rows [2].

You might also want to have a look at polars. It is a dataframe library like pandas, but has a different, declarative API. polars also has a streaming API with scan_csv and sink_csv, which may be what you want here. The caveat is this streaming API is experimental and not yet very well documented.

For a 2-5GB CSV file, though, on an 8GB machine, I think you should be able to load the whole thing in memory, especially given the inefficiency of CSV files will get reduced once converted to the in-memory Arrow data format.


[0] In some cases pandas falls back to pure Python for reading, so you might want to make sure you're not falling into one of those cases.

[1] I find this is usually where Python performance issues are most apparent, because such computations run for every single row in your dataset. The first question with performance is often less "how do I my operations quicker?", but "what am I doing a lot of?". That's why profiling is so important.

[2] You can sometimes get around this with judicious use of shifting.

2 Comments

ok i will try and understand what you said thank so much forthe effort
It's hard to tell what can be done without knowing more details if the question is vague your answer will be vague too: they're both useless for the community.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.