1

Below is a snippet that converts data into a NumPy array. It is then converted to a Pandas DataFrame where I intend to process it. I'm attempting to convert it back to a NumPy array. I'm failing at this. Badly.

import pandas as pd
import numpy as np
from pprint import pprint

data = [
    ('2020-11-01 00:00:00', 1.0),
    ('2020-11-02 00:00:00', 2.0)
]
coordinatesType = [('timestamp', 'datetime64[s]'), ('value', '<f8')]

npArray = np.asarray(data, coordinatesType)
df = pd.DataFrame(data = npArray)

# do some pandas processing, then convert back to a numpy array

mutatedNpArray = df.to_numpy(coordinatesType)
pprint(mutatedNpArray)

# don't suply dtype for kicks
pprint(df.to_numpy())

This yields crazytown:

array([[('2020-11-01T00:00:00', 1.6041888e+18),
        ('1970-01-01T00:00:01', 1.0000000e+00)],
       [('2020-11-02T00:00:00', 1.6042752e+18),
        ('1970-01-01T00:00:02', 2.0000000e+00)]],
      dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
array([[Timestamp('2020-11-01 00:00:00'), 1.0],
       [Timestamp('2020-11-02 00:00:00'), 2.0]], dtype=object)

I realize a DataFrame is really a fancy NumPy array under the hood, but I'm passing back to a function that accepts a simple NumPy array. Clearly I'm not handling dtypes correctly and/or I don't understand the data structure inside my DataFrame. Below is what the function I'm calling expects:

[('2020-11-01T00:00:00', 1.000   ),
 ('2020-11-02T00:00:00', 2.000  )],
 dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])

I'm really lost on how to do this. Or what I should be doing instead.

Help!


As @hpaul suggested, I tried the following:

# ...
df = df.set_index('timestamp')

# do some pandas processing, then convert back to a numpy array

mutatedNpArray = df.to_records(coordinatesType)
# ...

All good!

1
  • 2
    Look for a to_records method. Don't forget to read the docs. You may be able specify dtype as you did originally. Commented Nov 26, 2020 at 4:31

1 Answer 1

1

Besides the to_records approach mentioned in comments, you can do:

df.apply(tuple, axis=1).to_numpy(coordinatesType)

Output:

array([('2020-11-01T00:00:00', 1.), ('2020-11-02T00:00:00', 2.)],
      dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])

Considerations:

I believe the issue here is related to the difference between the original array and the dataframe.

The shape your original numpy array is (2,), where each value is a tuple. When creating the dataframe, both df.shape and df.to_numpy() shapes are (2, 2) so that the dtype constructor does not work as expected. When converting rows to tuples into a pd.Series, you get the original shape of (2,).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.