4

A pandas DataFrame has the limitation of fixed integer datatypes (int64). NumPy arrays don't have this limitation; we can use np.int8, for example (we also have different float sizes available). (Limitation no longer exists.)

Will scikit-learn performance generally improve on large datasets if we first convert the DataFrame to a raw NumPy array with datatypes of reduced size (e.g. from np.float64 to np.float16)? If so, does this possible performance boost only come into play when memory is limited?

It seems that really high float precision is often unimportant to ML relative to computational size and complexity.

If more context is needed, I'm considering the application of ensemble learners like RandomForestRegressor to large datasets (4-16GB, tens of millions of records consisting of ~10-50 features). However, I'm most interested in the general case.

1 Answer 1

2

The documentation for RandomForestRegressor states that the input samples will be converted to dtype=np.float32 internally.


Below is the original answer, which addresses the issue of using custom numpy types in Pandas (the struck-through part of the question)

You can use numpy dtypes in Pandas. Here is an example (from a script of mine) of importing a .csv file with specified column dtypes:

df = pd.read_csv(filename, usecols=[0, 4, 5, 10],
                 dtype={0: np.uint8,
                        4: np.uint32,
                        5: np.uint16,
                        10: np.float16})

You can change the dtype of an existing Series or of a column in an existing DataFrame using Series.astype():

s = pd.Series(...)
s = s.astype(np.float16)

df = pd.DataFrame(...)
df['col1'] = df['col1'].astype(np.float16)

If you want to change the dtypes of several columns in a DataFrame, or even of all columns, use DataFrame.astype():

df = pd.DataFrame(...)
df[['col1', 'col2']] = df[['col1', 'col2']].astype(np.float16)
Sign up to request clarification or add additional context in comments.

6 Comments

I thought I saw some limitation regarding mixed types in pandas (even if only mixed across different Series). I'll see if I can find specifics.
Ok, I think this limitation is specific to reading data in; i.e. read_csv. I will give this a try and see if it works. Assuming it does, do you expect a performance boost?
What limitation? You can pass a dtype argument to read_csv, as I showed above. As for a performance boost, I don't know: profile your code. What you will definitely see is reduced memory usage, which for the large datasets you want to work with could be important.
I found this example that outlines my issue, showing an attempt at reduced precision (int16) resulting in int64. Maybe this limitation is specific to read_csv and overcome with astype. I'll try to report back soon. Also, I'm still curious whether in general, performance is improved through less precise datatypes.
That issue was solved by this pull request: github.com/pydata/pandas/pull/2708 (so since v0.11).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.