1

I am exposing a C++ class to python using pybind11.

It takes a numpy.array in its constructor, and grabs a pointer to its internal data. (It does not copy the data).

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <iostream>

namespace py = pybind11;

struct Data
{
    Data(const py::array_t<double, py::array::c_style| py::array::forcecast>& arr)
        : p(arr.data())
    {
        std::cout << "arr=" << p    << std::endl;
        std::cout << "[0]=" << p[0] << std::endl;
    }
    const double* p;
};

I have another class which accepts a const Data&, thereby gaining access to the array data.

struct Manager
{
    Manager(const Data& data)
        : data_(data)
    {
        const double* p = data_.p;

        std::cout << "data.arr=" << p    << std::endl;
        std::cout << "data.[0]=" << p[0] << std::endl;
    }
    const Data& data_;
};

Here the two classes are exposed to python using pybind11:

PYBIND11_MODULE(foo, m)
{
    py::class_<Data>(m, "Data")
        .def(py::init<const py::array_t<double, py::array::c_style| py::array::forcecast>&>());

    py::class_<Manager>(m, "Manager")
        .def(py::init<const Data&>());
}

This is working well. I can import my module, create a Data instance from a numpy.array, and then pass that to Manager:

>>> import pandas
>>> import numpy
>>> import foo

>>> df = pandas.DataFrame(data = numpy.random.rand(990000, 7))
>>> d = foo.Data(df.values)
>>> c = foo.Manager(d)

My script works fine, and you can see my C++ code accessing the numpy.array data and printing its address and first element to stdout:

arr=0x7f47df313010
[0]=0.980507
data.arr=0x7f47df313010
data.[0]=0.980507

All of the above I created in an attempt to create a MCVE to illustrate the problem I am experiencing below.

Now, however, I load a pandas DataFrame pickle file which I have (here is a download link for the pickle file in question):

>>> import pandas
>>> import foo

>>> df = pandas.read_pickle('data5.pk') 
>>> a = df.values
>>> d = foo.Data(a)
>>> c = foo.Manager(d)

and my C++ code crashes attempting to access the array data.

Here is stdout:

arr=0x7f8864241010
arr[0]=7440.7
data.arr=0x7f8864241010
<dumps core>

So the pointer to the array is the same in Manager, but attempting to dereference the pointer causes a SEGV.

Running it through valgrind, valgrind reports Access not within mapped region at address 0x7f8864241010 (ie: the address of the numpy.array).

Python is perfectly happy with my pickle file:

>>> import pandas

>>> df = pandas.read_pickle('data5.pk')
>>> df.shape
(990000, 7) 
>>> df
                  A             B             C            D            E  \
10000   7440.695240  15055.443905  14585.542158  3647.710616  8139.777981   
10001   7440.607794  15055.356459  14585.454712  3647.623171  8139.690536   
10002   7441.155761  15055.904426  14586.002679  3648.171138  8140.238503   
10003   7440.430209  15055.178874  14585.277127  3647.445585  8139.512950   
10004   7440.418058  15055.166724  14585.264977  3647.433435  8139.500800   
10005   7440.906603  15055.655268  14585.753521  3647.921979  8139.989344   
10006   7440.525167  15055.273832  14585.372085  3647.540543  8139.607908
...

I cannot for the life of me figure out what is wrong with my pickle file.

  • I have tried creating a numpy.array and pickling, that works fine
  • I have tried creating a pandas.DataFrame and pickling, that works fine
  • I have sliced up my "invalid" dataframe and I can get a subset which works fine

There is something in my data which python is happy about, but causes a SEGV in C++.

How can I diagnose this?

3
  • Why are you blaming the pickle? Commented Jul 13, 2018 at 17:25
  • @user2357112 I'm blaming this particular pickle file. I cannot replicate the SEGV with other data (eg: numpy.random.rand etc). Commented Jul 13, 2018 at 17:26
  • In addition, my python script is exactly the same in all ways, except in one instance I create an array of random data, and in the other I read the data from a pickle file Commented Jul 13, 2018 at 17:27

1 Answer 1

2

The pickle is fine. It's your code that's wrong. You take a pointer to the array's data without doing anything to ensure that that data actually lives as long as the object that uses it.

You need to keep a reference to the array and perform the associated refcount management. pybind11 probably has some sort of mechanism to represent a Python reference and handle the refcounting for you. From a quick look at the docs, it looks like your code should probably take an array_t by value instead of const reference (as an array_t already represents a Python reference), and store it to an array_t instance variable.

Sign up to request clarification or add additional context in comments.

6 Comments

Surely the fact that I have a variable df which is keeping the DataFrame alive in python is enough to keep the array from being destroyed? Or does python read ahead and know that df is not used afterwards, so it can speculatively delete the resources?
Note that I've created multiple other pickle files from random data and subsets of the "problem" data, and they all work
@SteveLorimer: The DataFrame is an entirely different object. df.values is not guaranteed to be attached to the DataFrame in any way; for mixed-dtype DataFrames, it will be a new array.
Ah I see, ok, I'll try capture a reference to the array first
I've updated the question to show capturing df.values in a variable first, and then passing that to my code. Same result unfortunately. In addition I would have thought if this was the cause of the problem (lifetime of a temporary) then I would see the crash happening in all my other test use cases, yet I cannot make it crash at all, only with this particular pickle (not even with other pickle files)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.