Can conversion from a pandas DataFrame to a raw numpy array improve ML performance?

Question

~~A pandas DataFrame has the limitation of fixed integer datatypes (int64). NumPy arrays don't have this limitation; we can use np.int8, for example (we also have different float sizes available).~~ (Limitation no longer exists.)

Will scikit-learn performance generally improve on large datasets if we first convert the DataFrame to a raw NumPy array with datatypes of reduced size (e.g. from np.float64 to np.float16)? If so, does this possible performance boost only come into play when memory is limited?

It seems that really high float precision is often unimportant to ML relative to computational size and complexity.

If more context is needed, I'm considering the application of ensemble learners like RandomForestRegressor to large datasets (4-16GB, tens of millions of records consisting of ~10-50 features). However, I'm most interested in the general case.

Alicia Garcia-Raboso · Accepted Answer · 2016-06-30 12:30:57Z

2

The documentation for RandomForestRegressor states that the input samples will be converted to dtype=np.float32 internally.

Below is the original answer, which addresses the issue of using custom numpy types in Pandas (the struck-through part of the question)

You can use numpy dtypes in Pandas. Here is an example (from a script of mine) of importing a .csv file with specified column dtypes:

df = pd.read_csv(filename, usecols=[0, 4, 5, 10],
                 dtype={0: np.uint8,
                        4: np.uint32,
                        5: np.uint16,
                        10: np.float16})

You can change the dtype of an existing Series or of a column in an existing DataFrame using Series.astype():

s = pd.Series(...)
s = s.astype(np.float16)

df = pd.DataFrame(...)
df['col1'] = df['col1'].astype(np.float16)

If you want to change the dtypes of several columns in a DataFrame, or even of all columns, use DataFrame.astype():

df = pd.DataFrame(...)
df[['col1', 'col2']] = df[['col1', 'col2']].astype(np.float16)

edited Jun 30, 2016 at 12:30

answered Jun 29, 2016 at 13:40

Alicia Garcia-Raboso

14k1 gold badge47 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Brian Bien Over a year ago

I thought I saw some limitation regarding mixed types in pandas (even if only mixed across different Series). I'll see if I can find specifics.

Brian Bien Over a year ago

Ok, I think this limitation is specific to reading data in; i.e. read_csv. I will give this a try and see if it works. Assuming it does, do you expect a performance boost?

Alicia Garcia-Raboso Over a year ago

What limitation? You can pass a dtype argument to read_csv, as I showed above. As for a performance boost, I don't know: profile your code. What you will definitely see is reduced memory usage, which for the large datasets you want to work with could be important.

Brian Bien Over a year ago

I found this example that outlines my issue, showing an attempt at reduced precision (int16) resulting in int64. Maybe this limitation is specific to read_csv and overcome with astype. I'll try to report back soon. Also, I'm still curious whether in general, performance is improved through less precise datatypes.

Alicia Garcia-Raboso Over a year ago

That issue was solved by this pull request: github.com/pydata/pandas/pull/2708 (so since v0.11).

|

Collectives™ on Stack Overflow

Can conversion from a pandas DataFrame to a raw numpy array improve ML performance?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related