Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?
3 Answers
pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.
Comments
I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, is built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth
Comments
I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as
recarrayswith less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to useminormaxas field name in arecarray?NumPy has been developed over a far longer period than
pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable thanpandasdataframes.Are
pandasdataframes easily pickable ? Can they be sent back and forth withPyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.