0

I have a large dataset that I perform experiments on. It takes 30 mins to load the dataset from file into memory using a python program. Then I perform variations of an algorithm on the dataset. Each time I have to vary the algorithm, I have to load the dataset into memory again, which eats up 30 minutes.

Is there any way to load the dataset into memory once and for always. And then each time to run a variation of an algorithm, just use that pre loaded dataset?

I know the question is a bit abstract, suggestions to improve the framing of the question are welcome. Thanks.

EDITS:

Its a text file, contains graph data, around 6 GB. If I only load a portion of the dataset, it doesn't make for a very good graph. I do not do computation while loading the dataset.

5
  • You could possibly try using a ram disk, or SSD. Not an answer to your question, sorry... Commented Dec 5, 2013 at 0:53
  • How are you loading your data set? Is it a .csv file, a database, or what? Do you perform computations during load or is it simply reading from disk for 30 minutes? Commented Dec 5, 2013 at 0:53
  • i would suggest that until you have not finalized your algorithm, work with a much smaller sample of full data ( if viable ). Commented Dec 5, 2013 at 0:58
  • What kind of file is it? How much data do you have? It seems incredible to me that if you're just reading from disk that it could take 30 minutes to load the data and not run you out of memory. Are you doing processing on the data as you read it? Commented Dec 5, 2013 at 0:59
  • @mgilson I have 32 GB of memory, and do a hash table lookup around 10-20 times for each line I pick up from the file before appending it to a python list. Commented Dec 5, 2013 at 1:43

3 Answers 3

1

You could write a very quick CLI which would load the data, and then ask for a python filename, which it would then eval() on the data...

Sign up to request clarification or add additional context in comments.

Comments

1

You could use an environment such as Spyder which is similar to Matlab. This allows you even to have a list of all variables in the workspace at any time during algorithm execution.

Comments

0

One possible solution is to use Jupyter to load it once and keep the Jupyter session running. Then you modify your algorithm in a cell and always rerun that cell alone. You can operate on the loaded dataset in RAM as much as you want until you terminate the Jupyter session.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.