2

I have an array with ~1,000,000 rows, each of which is a numpy array of 4,800 float32 numbers. I need to save this as a csv file, however using numpy.savetxt has been running for 30 minutes and I don't know how much longer it will run for. Is there a faster method of saving the large array as a csv? Many thanks, Josh

5
  • have you tried using pandas ? Commented Aug 27, 2019 at 17:50
  • 4
    1 million rows each 4800x4 = 19200B means 20GB of data IF packed in binary form! You are outputing as ASCII so that's probably twice as much data. I don't know how much time it could take but 30 minutes seems reasonable... maybe even one hour on a slow disk. Commented Aug 27, 2019 at 17:51
  • I'll give it some more time and then try pandas, any idea how much faster pandas might be? It should write to the file quicker but does need to convert the data set into the pandas format first which is very memory intensive. I'm using a dell xps 9550 which isn't the slowest computer Commented Aug 27, 2019 at 17:53
  • 1
    I would strongly advise you to use a stream compression and save that thing compressed in binary form. I don't know how much RAM your PC has, but that data is probably taking a big portion of it (which makes things even slower). Commented Aug 27, 2019 at 17:57
  • I would simply save it as binary and then write a conversion script from binary to ascii. You can let it run overnight and not have to worry about the bottleneck Commented Aug 27, 2019 at 19:08

1 Answer 1

2

As pointed out in the comments, 1e6 rows * 4800 columns * 4 bytes per float32 is 18GiB. Writing a float to text takes ~9 bytes of text (estimating 1 for integer, 1 for decimal, 5 for mantissa and 2 for separator), which comes out to 40GiB. This takes a long time to do, since just the conversion to text itself is non-trivial, and disk I/O will be a huge bottle-neck.

One way to optimize this process may be to convert the entire array to text on your own terms, and write it in blocks using Python's binary I/O. I doubt that will give you too much benefit though.

A much better solution would be to write the binary data to a file instead of text. Aside from the obvious advantages of space and speed, binary has the advantage of being searchable and not requiring transformation after loading. You know where every individual element is in the file, if you are clever, you can access portions of the file without loading the entire thing. Finally, a binary file is more likely to be highly compressible than a relatively low-entropy text file.

Disadvantages of binary are that it is not human-readable, and not as portable as text. The latter is not a problem, since transforming into an acceptable format will be trivial. The former is likely a non-issue given the amount of data you are attempting to process anyway.

Keep in mind that human readability is a relative term. A human can not read 40iGB of numerical data with understanding. A human can process A) a graphical representation of the data, or B) scan through relatively small portions of the data. Both cases are suitable for binary representations. Case A) is straightforward: load, transform and plot the data. This will be much faster if the data is already in a binary format that you can pass directly to the analysis and plotting routines. Case B) can be handled with something like a memory mapped file. You only ever need to load a small portion of the file, since you can't really show more than say a thousand elements on screen at one time anyway. Any reasonable modern platform should be able to keep upI/O and binary-to-text conversion associated with a user scrolling around a table widget or similar. In fact, binary makes it easier since you know exactly where each element belongs in the file.

Sign up to request clarification or add additional context in comments.

3 Comments

The problem is that ideally the file needs to be human-readable. Using numpy.savetxt, I assume the bottleneck will be the read/write capabilities of my laptop?
@JTovell. Using anything to dump 20 or more GB to disk will make I/O the bottleneck.
@JTovell. I've added a paragraph that hopefully addresses your concern

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.