Python: large number of dict like objects memory use

Question

I am using csv.DictReader to read some large files into memory to then do some analysis, so all objects from multiple CSV files need to be kept in memory. I need to read them as Dictionary to make analysis easier, and because the CSV files may be altered by adding new columns.

Yes SQL can be used, but I'd rather avoid it if it's not needed.

I'm wondering if there is a better and easier way of doing this. My concern is that I will have many dictionary objects with same keys and waste memory? The use of __slots__ was an option, but I will only know the attributes of an object after reading the CSV.

[Edit:] Due to being on legacy system and "restrictions", use of third party libraries is not possible.

Janne Karila · Accepted Answer · 2011-12-07 10:24:30Z

2

If you are on Python 2.6 or later, collections.namedtuple is what you are asking for.

See http://docs.python.org/library/collections.html#collections.namedtuple (there is even an example of using it with csv).

EDIT: It requires the field names to be valid as Python identifiers, so perhaps it is not suitable in your case.

edited Dec 7, 2011 at 10:24

answered Dec 7, 2011 at 9:53

Janne Karila

25.3k6 gold badges59 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ayman Over a year ago

In 2.7: If rename is true, invalid fieldnames are automatically replaced with positional names. For example, ['abc', 'def', 'ghi', 'abc'] is converted to ['abc', '_1', 'ghi', '_3'], eliminating the keyword def and the duplicate fieldname abc.

BenMorel · Accepted Answer · 2014-01-06 13:41:20Z

1

Have you considered using pandas.

It is works very good for tables. Relevant for you are the read_csv function and the dataframe type.

This is how you would use it:

>>> import pandas
>>> table = pandas.read_csv('a.csv')
>>> table
   a  b  c   
0  1  2  a   
1  2  4  b   
2  5  6  word
>>> table.a
0    1
1    2
2    5
Name: a

edited Jan 6, 2014 at 13:41

BenMorel

37k52 gold badges208 silver badges339 bronze badges

answered Dec 7, 2011 at 7:39

SiggyF

23.3k8 gold badges46 silver badges57 bronze badges

2 Comments

David H. Clements Over a year ago

I've been working with pandas recently. It is an excellent toolkit for this sort of problem.

Ayman Over a year ago

Looks good, but unable to use third party libs. And my usage is such that each row is one entity.

sharjeel · Accepted Answer · 2011-12-07 07:02:33Z

0

Use python shelve. It is a dictionary like object but can be dumped on disk when required and loaded back very easily.

answered Dec 7, 2011 at 7:02

sharjeel

6,0457 gold badges37 silver badges49 bronze badges

1 Comment

jsbueno Over a year ago

I don think its relevant on this case.

HYRY · Accepted Answer · 2011-12-07 07:28:23Z

0

If all the data in one column are the same type, you can use NumPy. NumPy's loadtxt and genfromtxt function can be used to read csv file. And because it returns an array, the memory usage is smaller then dict.

answered Dec 7, 2011 at 7:28

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

1 Comment

Ayman Over a year ago

Thanks for the mention, but cannot use libs. :( I edited my question.

John Machin · Accepted Answer · 2011-12-07 11:39:40Z

Possibilities:

(1) Benchmark the csv.DictReader approach and see if it causes a problem. Note that the dicts contain POINTERS to the keys and values; the actual key strings are not copied into each dict.

(2) For each file, use csv.Reader, after the first row, build a class dynamically, instantiate it once per remaining row. Perhaps this is what you had in mind.

(3) Have one fixed class, instantiated once per file, which gives you a list of tuples for the actual data, a tuple that maps column indices to column names, and a dict that maps column names to column indices. Tuples occupy less memory than lists because there is no extra append-space allocated. You can then get and set your data via (row_index, column_index) and (row_index, column_name).

In any case, to get better advice, how about some simple facts and stats: What version of Python? How many files? rows per file? columns per file? total unique keys/column names?

Collectives™ on Stack Overflow

Python: large number of dict like objects memory use

5 Answers 5

1 Comment

2 Comments

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related