I am exposing a C++ class to python using pybind11.
It takes a numpy.array in its constructor, and grabs a pointer to its internal data. (It does not copy the data).
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <iostream>
namespace py = pybind11;
struct Data
{
Data(const py::array_t<double, py::array::c_style| py::array::forcecast>& arr)
: p(arr.data())
{
std::cout << "arr=" << p << std::endl;
std::cout << "[0]=" << p[0] << std::endl;
}
const double* p;
};
I have another class which accepts a const Data&, thereby gaining access to the array data.
struct Manager
{
Manager(const Data& data)
: data_(data)
{
const double* p = data_.p;
std::cout << "data.arr=" << p << std::endl;
std::cout << "data.[0]=" << p[0] << std::endl;
}
const Data& data_;
};
Here the two classes are exposed to python using pybind11:
PYBIND11_MODULE(foo, m)
{
py::class_<Data>(m, "Data")
.def(py::init<const py::array_t<double, py::array::c_style| py::array::forcecast>&>());
py::class_<Manager>(m, "Manager")
.def(py::init<const Data&>());
}
This is working well. I can import my module, create a Data instance from a numpy.array, and then pass that to Manager:
>>> import pandas
>>> import numpy
>>> import foo
>>> df = pandas.DataFrame(data = numpy.random.rand(990000, 7))
>>> d = foo.Data(df.values)
>>> c = foo.Manager(d)
My script works fine, and you can see my C++ code accessing the numpy.array data and printing its address and first element to stdout:
arr=0x7f47df313010
[0]=0.980507
data.arr=0x7f47df313010
data.[0]=0.980507
All of the above I created in an attempt to create a MCVE to illustrate the problem I am experiencing below.
Now, however, I load a pandas DataFrame pickle file which I have (here is a download link for the pickle file in question):
>>> import pandas
>>> import foo
>>> df = pandas.read_pickle('data5.pk')
>>> a = df.values
>>> d = foo.Data(a)
>>> c = foo.Manager(d)
and my C++ code crashes attempting to access the array data.
Here is stdout:
arr=0x7f8864241010
arr[0]=7440.7
data.arr=0x7f8864241010
<dumps core>
So the pointer to the array is the same in Manager, but attempting to dereference the pointer causes a SEGV.
Running it through valgrind, valgrind reports Access not within mapped region at address 0x7f8864241010 (ie: the address of the numpy.array).
Python is perfectly happy with my pickle file:
>>> import pandas
>>> df = pandas.read_pickle('data5.pk')
>>> df.shape
(990000, 7)
>>> df
A B C D E \ 10000 7440.695240 15055.443905 14585.542158 3647.710616 8139.777981 10001 7440.607794 15055.356459 14585.454712 3647.623171 8139.690536 10002 7441.155761 15055.904426 14586.002679 3648.171138 8140.238503 10003 7440.430209 15055.178874 14585.277127 3647.445585 8139.512950 10004 7440.418058 15055.166724 14585.264977 3647.433435 8139.500800 10005 7440.906603 15055.655268 14585.753521 3647.921979 8139.989344 10006 7440.525167 15055.273832 14585.372085 3647.540543 8139.607908 ...
I cannot for the life of me figure out what is wrong with my pickle file.
- I have tried creating a
numpy.arrayand pickling, that works fine - I have tried creating a
pandas.DataFrameand pickling, that works fine - I have sliced up my "invalid" dataframe and I can get a subset which works fine
There is something in my data which python is happy about, but causes a SEGV in C++.
How can I diagnose this?
numpy.random.randetc).