2

This post provides an elegant way to create an empty pandas DataFrame of a specified data type. And if you specify np.nan values when you initialize it, the data type is set to float:

df_training_outputs = pd.DataFrame(np.nan, index=index, columns=column_names)

But I want to create an empty DataFrame with different data types in each column. It seems the dtype keyword argument will only accept one.

Background: I am writing a script that generates data incrementally and so I need somewhere to store it during the execution of the script. I thought an empty data frame (large enough to take all the expected data) would be the best way to do this. This must be a fairly common tasks so if someone has a better way please advise.

1
  • May be it will be effective to use a number of Series for each column and concatenate them when a DataFrame is needed? Commented May 23, 2016 at 11:52

1 Answer 1

3

One way you can create an empty dataframe with columns of different types is by providing an empty numpy array with a correct structured dtype:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.empty(0, dtype=[('a', 'u4'), ('b', 'S20'), ('c', 'f8')]))

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 3 columns):
a    0 non-null uint32
b    0 non-null object
c    0 non-null float64
dtypes: float64(1), object(1), uint32(1)
memory usage: 76.0+ bytes
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks. This works. However, perhaps not surprisingly, I noticed a huge speed difference in doing it this way (3.23 s to complete) compared to the earlier method above (168 ms) where the dataframe was created entirely of the same data type (float). So in my case I think it's better to first fill the dataframe with floats then convert the desired columns to integers at the end.
To clarify: by speed difference, I mean the time it takes to fill the resulting dataframe with values using setter methods such as df.at[] = ...
@Bill there's nothing surprising here really, for homogeneous arrays pandas may use a single 2-D container as a backend.
@Bill You could also try just using raw numpy record array, and then convert it to a dataframe at the very end, this way it's zero-copy and could be faster even than the homogeneous dataframe approach.
Thanks @aldanor. I realize now a dataframe wasn't the right approach for capturing this data. I need to build the data in separate but fast and efficient data objects such as pandas.Series or numpy arrays and then combine them at the end into a data frame.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.