Python pandas: construct list dataclass objects from each row of a dataframe

Question

A consistent answer seems to be to avoid iterating over rows while working with Pandas. I'd like to understand how I can do so in the following case.

from typing import List

@dataclass
class Person:
    id: int
    name: str
    age: int

persons_df = pd.DataFrame(data={'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86']})

persons_list: List[Person] = [] #populate this list with Person objects, created from the dataframe above

# my approach is to use iterrows()
for row in persons_df.itertuples():
    person = Person(row.id, row.name, int(row.age)) # type: ignore
    plist.append(person)

I'd like to find an option which can avoid the iterrows, and if possible, be done in a manner that has some type safety built in (avoid the mypy ignore comment).

thanks!

Andreas · Accepted Answer · 2021-05-04 22:08:59Z

4

I am not sure if thats what you are looking for, but maybe this helps:

import pandas as pd
df = pd.DataFrame(data={'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86']})

class Person:
    def __init__(self, lst):
        self.id = lst[0]
        self.name = lst[1]
        self.age = lst[2]

df.apply(Person, axis=1).tolist()

out:

[<__main__.Person at 0x176eee70608>,
 <__main__.Person at 0x176eee704c8>,
 <__main__.Person at 0x176eee70388>]

answered May 4, 2021 at 22:08

Andreas

9,2853 gold badges20 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

anerjee Over a year ago

Thanks! This option is about 10x slower than itertuples - so it won't scale as well.

Andreas Over a year ago

@anerjee, Thanks for the feedback. Sorry I wasn't aware that your goal was to optimize speed, pd.map and pd.apply are not the best choice for that. I thought you want to skip the for loop and avoid the ignore comment.

amanin · Accepted Answer · 2023-02-01 15:06:19Z

I add a new answer, because the title of the question is map dataframe rows to a list of dataclass objects, and this has not been addressed yet.

To return dataclasses, we can slightly improve @Andreas answer, without requiring an additional constructor receiving a list. We just have to use Python spread operators.

I see two ways of mapping:

The dataframe column names match the data class field names. In this case, we can ask to map our row as a set of keyword arguments: df.apply(lambda row: MyDataClass(**row), axis=1)
The dataframe column names does not match data class field names, but column order match dataclass field order. In this case, we can ask that our row values are passed as a list of ordered arguments: df.apply(lambda row: MyDataClass(*row), axis=1)

Example:

Define same data class and same dataframe as in the question:

from dataclasses import dataclass

@dataclass
class Person:
    id: int
    name: str
    age: int

import pandas

df = pandas.DataFrame(data={
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'], 
    'age': [32, 44, '86']
})

Conversion based on column order:

persons = df.apply(lambda row: Person(*row), axis=1)

Conversion based on column names (column order is shuffled for a better test):

persons = df[['age', 'id', 'name']].apply(lambda row: Person(**row), axis=1)

Now, we can verify our result. In both cases above:

This snippet:
```
print(type(persons))
print(persons)
```

prints:

<class 'pandas.core.series.Series'>
0      Person(id=1, name='A', age=32)
1      Person(id=2, name='B', age=44)
2    Person(id=3, name='C', age='86')
dtype: object

WARNINGS:

I have no idea of the performance of this solution
This does not enforce any type checking (look at last person printed: its age is a text). As Python does not enforce typing by default, this quick solution does not bring any additional safety.

kowpow · Accepted Answer · 2023-06-07 21:25:38Z

1

One additional option would be to iterate through a numpy array with a generator..

So for example given:

from dataclasses import dataclass

@dataclass
class Person:
    id: int
    name: str
    age: int

import pandas

df = pandas.DataFrame(data={
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'], 
    'age': [32, 44, '86']
})

you can run:

persons = list(Person(*row) for row in df.values)

answered Jun 7, 2023 at 21:25

kowpow

1252 silver badges9 bronze badges

Comments

Дмитрий Песков · Accepted Answer · 2025-03-20 07:24:21Z

Please don't slap my hands for this code ;)

I added dynamic data class generation based on information from the dataset itself.

In other words, you don't have to declare the data class yourself.

Take a look at the code below

from dataclasses import make_dataclass
from typing import Optional, Any
import pandas

def dataframe_to_dataclasses(df: pandas.DataFrame, class_name: str) -> list[Any]:
    # make a list of fields for the future data class
    fields = []
    for column_name in df.columns:
        original_column_type = df[column_name].dtype
        not_null = all(pandas.notnull(df[column_name]))
        column_type = (
            original_column_type if not_null else Optional[original_column_type]
        )
        field = (column_name, column_type)
        fields.append(field)
    # make dataclass
    dclass = make_dataclass(cls_name=class_name, fields=fields)
    # make a list of instances dataclasses
    instances = []
    for _, row in df.iterrows():
        i = dclass(*row)
        instances.append(i)
    return instances

Be careful! As you can see, there are no checks for compliance with the naming rules for class attributes. If Cyrillic characters occur in the source set, the first character of the column name will be a number or it will consist of several words, we will get an exception. Perhaps someone will like this approach and decide to refine/improve it. Usage example:

from uuid import uuid4
from datetime import date, datetime
import pandas

data = {
    "id": [1, 2, 3],
    "date": [date(2025, 4, 12), date(2024, 3, 2), date(2023, 4, 18)],
    "moment": [
        datetime(2025, 4, 12, 23, 12),
        datetime(2024, 3, 2, 17, 41),
        datetime(2023, 4, 18, 11, 32),
    ],
    "label": [uuid4(), uuid4(), uuid4()],
    "description": ["Первый", None, "Третий"],
    "price": [231.73, 532.89, 50.7],
}
table = pandas.DataFrame(data)
instances = dataframe_to_dataclasses(table, "Test")
for i in instances:
    print(
        "class: ",
        i,
        "attribute_types:",
        type(i.id),
        type(i.date),
        type(i.moment),
        type(i.label),
        type(i.description),
        type(i.price),
    )

The code above will output to the console

class:  Test(id=1, date=datetime.date(2025, 4, 12), moment=Timestamp('2025-04-12 23:12:00'), label=UUID('5ad3582b-91f3-48b7-b904-9223ea867402'), description='Первый', price=231.73) attribute_types: <class 'int'> <class 'datetime.date'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'uuid.UUID'> <class 'str'> <class 'float'>
class:  Test(id=2, date=datetime.date(2024, 3, 2), moment=Timestamp('2024-03-02 17:41:00'), label=UUID('1be89b91-c940-42a7-8248-973ef99fd98d'), description=None, price=532.89) attribute_types: <class 'int'> <class 'datetime.date'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'uuid.UUID'> <class 'NoneType'> <class 'float'>
class:  Test(id=3, date=datetime.date(2023, 4, 18), moment=Timestamp('2023-04-18 11:32:00'), label=UUID('90451cf8-5d95-4bd4-8166-7c0d92b26990'), description='Третий', price=50.7) attribute_types: <class 'int'> <class 'datetime.date'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'uuid.UUID'> <class 'str'> <class 'float'>

Collectives™ on Stack Overflow

Python pandas: construct list dataclass objects from each row of a dataframe

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related