0

I have a large dataset of events in a Postgres database that is too large to analyze in memory. Therefore I would like to quantize the datetimes to a regular interval and perform group by operations within the database prior to returning results. I thought I would use SqlSoup to iterate through the records in the appropriate table and make the necessary transformations. Unfortunately I can't figure out how to perform the iteration in such a way that I'm not loading references to every record into memory at once. Is there some way of getting one record reference at a time in order to access the data and update each record as needed?

Any suggestions would be most appreciated!

Chris

2
  • A code sample showing the basic problem would allow someone to make a concrete suggestion. Commented Apr 28, 2012 at 1:52
  • This is vary vague. Why do you want to perform "row at a time" processing (iterating)? Is your data actually a graph with records "pointing" to multiple other records without any grouping or nesting? And: 10^7 records is not big for a database. Commented Apr 28, 2012 at 11:49

1 Answer 1

1

After talking with some folks, it's pretty clear the better answer is to use Pig to process and aggregate my data locally. At the scale, I'm operating it wasn't clear Hadoop was the appropriate tool to be reaching for. One person I talked to about this suggests Pig will be orders of magnitude faster than in-DB operations at the scale I'm operating at which is about 10^7 records.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.