How do I create an empty pandas DataFrame with different data types assigned to each column?

Question

This post provides an elegant way to create an empty pandas DataFrame of a specified data type. And if you specify np.nan values when you initialize it, the data type is set to float:

df_training_outputs = pd.DataFrame(np.nan, index=index, columns=column_names)

But I want to create an empty DataFrame with different data types in each column. It seems the dtype keyword argument will only accept one.

Background: I am writing a script that generates data incrementally and so I need somewhere to store it during the execution of the script. I thought an empty data frame (large enough to take all the expected data) would be the best way to do this. This must be a fairly common tasks so if someone has a better way please advise.

May be it will be effective to use a number of Series for each column and concatenate them when a DataFrame is needed? — knagaev
– knagaev, Commented May 23, 2016 at 11:52

aldanor · Accepted Answer · 2016-05-22 23:47:31Z

3

One way you can create an empty dataframe with columns of different types is by providing an empty numpy array with a correct structured dtype:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.empty(0, dtype=[('a', 'u4'), ('b', 'S20'), ('c', 'f8')]))

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 3 columns):
a    0 non-null uint32
b    0 non-null object
c    0 non-null float64
dtypes: float64(1), object(1), uint32(1)
memory usage: 76.0+ bytes

answered May 22, 2016 at 23:47

aldanor

3,4912 gold badges29 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Bill Over a year ago

Thanks. This works. However, perhaps not surprisingly, I noticed a huge speed difference in doing it this way (3.23 s to complete) compared to the earlier method above (168 ms) where the dataframe was created entirely of the same data type (float). So in my case I think it's better to first fill the dataframe with floats then convert the desired columns to integers at the end.

Bill Over a year ago

To clarify: by speed difference, I mean the time it takes to fill the resulting dataframe with values using setter methods such as df.at[] = ...

aldanor Over a year ago

@Bill there's nothing surprising here really, for homogeneous arrays pandas may use a single 2-D container as a backend.

aldanor Over a year ago

@Bill You could also try just using raw numpy record array, and then convert it to a dataframe at the very end, this way it's zero-copy and could be faster even than the homogeneous dataframe approach.

Bill Over a year ago

Thanks @aldanor. I realize now a dataframe wasn't the right approach for capturing this data. I need to build the data in separate but fast and efficient data objects such as pandas.Series or numpy arrays and then combine them at the end into a data frame.

|

Collectives™ on Stack Overflow

How do I create an empty pandas DataFrame with different data types assigned to each column?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related