1

I am using csv.DictReader to read some large files into memory to then do some analysis, so all objects from multiple CSV files need to be kept in memory. I need to read them as Dictionary to make analysis easier, and because the CSV files may be altered by adding new columns.

Yes SQL can be used, but I'd rather avoid it if it's not needed.

I'm wondering if there is a better and easier way of doing this. My concern is that I will have many dictionary objects with same keys and waste memory? The use of __slots__ was an option, but I will only know the attributes of an object after reading the CSV.

[Edit:] Due to being on legacy system and "restrictions", use of third party libraries is not possible.

5 Answers 5

2

If you are on Python 2.6 or later, collections.namedtuple is what you are asking for.

See http://docs.python.org/library/collections.html#collections.namedtuple (there is even an example of using it with csv).

EDIT: It requires the field names to be valid as Python identifiers, so perhaps it is not suitable in your case.

Sign up to request clarification or add additional context in comments.

1 Comment

In 2.7: If rename is true, invalid fieldnames are automatically replaced with positional names. For example, ['abc', 'def', 'ghi', 'abc'] is converted to ['abc', '_1', 'ghi', '_3'], eliminating the keyword def and the duplicate fieldname abc.
1

Have you considered using pandas.

It is works very good for tables. Relevant for you are the read_csv function and the dataframe type.

This is how you would use it:

>>> import pandas
>>> table = pandas.read_csv('a.csv')
>>> table
   a  b  c   
0  1  2  a   
1  2  4  b   
2  5  6  word
>>> table.a
0    1
1    2
2    5
Name: a

2 Comments

I've been working with pandas recently. It is an excellent toolkit for this sort of problem.
Looks good, but unable to use third party libs. And my usage is such that each row is one entity.
0

Use python shelve. It is a dictionary like object but can be dumped on disk when required and loaded back very easily.

1 Comment

I don think its relevant on this case.
0

If all the data in one column are the same type, you can use NumPy. NumPy's loadtxt and genfromtxt function can be used to read csv file. And because it returns an array, the memory usage is smaller then dict.

1 Comment

Thanks for the mention, but cannot use libs. :( I edited my question.
0

Possibilities:

(1) Benchmark the csv.DictReader approach and see if it causes a problem. Note that the dicts contain POINTERS to the keys and values; the actual key strings are not copied into each dict.

(2) For each file, use csv.Reader, after the first row, build a class dynamically, instantiate it once per remaining row. Perhaps this is what you had in mind.

(3) Have one fixed class, instantiated once per file, which gives you a list of tuples for the actual data, a tuple that maps column indices to column names, and a dict that maps column names to column indices. Tuples occupy less memory than lists because there is no extra append-space allocated. You can then get and set your data via (row_index, column_index) and (row_index, column_name).

In any case, to get better advice, how about some simple facts and stats: What version of Python? How many files? rows per file? columns per file? total unique keys/column names?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.