0

I am currently working on the Dstl satellite kaggle challenge. There I need to create a submission file that is in csv format. Each row in the csv contains:

Image ID, polygon class (1-10), Polygons

Polygons are a very long entry with starts and ends and starts etc.

The polygons are created with an algorithm, for one class at a time, for one picture at a time (429 pictures, 10 classes each).

Now my question is related to computation time and best practice: How do I best write the data of the polygons that I create into the csv? Do I open the csv at the beginning and then write each row into the file, as I iterate over classes and images?

Or should I rather save the data in a list or dictionary or something and then write the whole thing into the csv file at once?

The thing is, I am not sure how fast the writing into a csv file is. Also, as the algorithm is already rather consuming computationally, I would like to save my pc the trouble of keeping all the data in the RAM.

And I guess writing the data into the csv right away would result in less RAM used, right?

So you say that disc operations are slow. What exactly does that mean? When I write into the csv each row live as I create the data, does that slow down my program? So if I write a whole list into a csv file that would be faster than writing a row, then again calculating a new data row? So that would mean, that the computer waits for an action to finish before the next action gets started, right? But then still, what makes the process faster if I wait for the whole data to accumulate? Anyway the same number of rows have to be written into the csv, why would it be slower if I do it line by line?

2 Answers 2

3

How do I best write the data of the polygons that I create into the csv? Do I open the csv at the beginning and then write each row into the file, as I iterate over classes and images?

I suspect most folks would gather the data in a list or perhaps dictionary and then write it all out at the end. But if you don't need to do additional processing to it, yeah -- send it to disk and release the resources.

And I guess writing the data into the csv right away would result in less RAM used, right?

Yes, it would but it's not going to impact CPU usage; just reduce RAM usage though it does depend on when Python GC them. You really shouldn't worry about details like this. Get accurate output, first and foremost.

Sign up to request clarification or add additional context in comments.

Comments

0

First of all, use the csv library. Documentation https://docs.python.org/2/library/csv.html (py2) or https://docs.python.org/3/library/csv.html (py3)

Now, using this library you can take a list of list-like objects or a list of dicts (where the keys are the headers of your csv) and write them to a file. This is almost certainly the right way to go. If you've got enough data that you've maxed out your memory for the python process, then you might want to go back and think about it some more but with 429 * 10 = 4290 rows, that's probably not happening.

And I guess writing the data into the csv right away would result in less RAM used, right?

Disk access is in general a relatively slow operation, so anything that maximizes disk access to save memory usage is a questionable approach.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.