2

A consistent answer seems to be to avoid iterating over rows while working with Pandas. I'd like to understand how I can do so in the following case.

from typing import List

@dataclass
class Person:
    id: int
    name: str
    age: int

persons_df = pd.DataFrame(data={'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86']})

persons_list: List[Person] = [] #populate this list with Person objects, created from the dataframe above

# my approach is to use iterrows()
for row in persons_df.itertuples():
    person = Person(row.id, row.name, int(row.age)) # type: ignore
    plist.append(person)

I'd like to find an option which can avoid the iterrows, and if possible, be done in a manner that has some type safety built in (avoid the mypy ignore comment).

thanks!

4 Answers 4

4

I am not sure if thats what you are looking for, but maybe this helps:

import pandas as pd
df = pd.DataFrame(data={'id': [1, 2, 3], 'name': ['A', 'B', 'C'], 'age': [32, 44, '86']})

class Person:
    def __init__(self, lst):
        self.id = lst[0]
        self.name = lst[1]
        self.age = lst[2]

df.apply(Person, axis=1).tolist()

out:

[<__main__.Person at 0x176eee70608>,
 <__main__.Person at 0x176eee704c8>,
 <__main__.Person at 0x176eee70388>]
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! This option is about 10x slower than itertuples - so it won't scale as well.
@anerjee, Thanks for the feedback. Sorry I wasn't aware that your goal was to optimize speed, pd.map and pd.apply are not the best choice for that. I thought you want to skip the for loop and avoid the ignore comment.
3

I add a new answer, because the title of the question is map dataframe rows to a list of dataclass objects, and this has not been addressed yet.

To return dataclasses, we can slightly improve @Andreas answer, without requiring an additional constructor receiving a list. We just have to use Python spread operators.

I see two ways of mapping:

  1. The dataframe column names match the data class field names. In this case, we can ask to map our row as a set of keyword arguments: df.apply(lambda row: MyDataClass(**row), axis=1)
  2. The dataframe column names does not match data class field names, but column order match dataclass field order. In this case, we can ask that our row values are passed as a list of ordered arguments: df.apply(lambda row: MyDataClass(*row), axis=1)

Example:

  1. Define same data class and same dataframe as in the question:
    from dataclasses import dataclass
    
    @dataclass
    class Person:
        id: int
        name: str
        age: int
    
    import pandas
    
    df = pandas.DataFrame(data={
        'id': [1, 2, 3],
        'name': ['A', 'B', 'C'], 
        'age': [32, 44, '86']
    })
    
  2. Conversion based on column order:
    persons = df.apply(lambda row: Person(*row), axis=1)
    
  3. Conversion based on column names (column order is shuffled for a better test):
    persons = df[['age', 'id', 'name']].apply(lambda row: Person(**row), axis=1)
    
  4. Now, we can verify our result. In both cases above:
    • This snippet:
      print(type(persons))
      print(persons)
      
    • prints:
      <class 'pandas.core.series.Series'>
      0      Person(id=1, name='A', age=32)
      1      Person(id=2, name='B', age=44)
      2    Person(id=3, name='C', age='86')
      dtype: object
      

WARNINGS:

  • I have no idea of the performance of this solution
  • This does not enforce any type checking (look at last person printed: its age is a text). As Python does not enforce typing by default, this quick solution does not bring any additional safety.

Comments

1

One additional option would be to iterate through a numpy array with a generator..

So for example given:

from dataclasses import dataclass

@dataclass
class Person:
    id: int
    name: str
    age: int

import pandas

df = pandas.DataFrame(data={
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'], 
    'age': [32, 44, '86']
})

you can run:

persons = list(Person(*row) for row in df.values)

Comments

0

Please don't slap my hands for this code ;)

I added dynamic data class generation based on information from the dataset itself.

In other words, you don't have to declare the data class yourself.

Take a look at the code below

from dataclasses import make_dataclass
from typing import Optional, Any
import pandas

def dataframe_to_dataclasses(df: pandas.DataFrame, class_name: str) -> list[Any]:
    # make a list of fields for the future data class
    fields = []
    for column_name in df.columns:
        original_column_type = df[column_name].dtype
        not_null = all(pandas.notnull(df[column_name]))
        column_type = (
            original_column_type if not_null else Optional[original_column_type]
        )
        field = (column_name, column_type)
        fields.append(field)
    # make dataclass
    dclass = make_dataclass(cls_name=class_name, fields=fields)
    # make a list of instances dataclasses
    instances = []
    for _, row in df.iterrows():
        i = dclass(*row)
        instances.append(i)
    return instances

Be careful! As you can see, there are no checks for compliance with the naming rules for class attributes. If Cyrillic characters occur in the source set, the first character of the column name will be a number or it will consist of several words, we will get an exception. Perhaps someone will like this approach and decide to refine/improve it. Usage example:

from uuid import uuid4
from datetime import date, datetime
import pandas

data = {
    "id": [1, 2, 3],
    "date": [date(2025, 4, 12), date(2024, 3, 2), date(2023, 4, 18)],
    "moment": [
        datetime(2025, 4, 12, 23, 12),
        datetime(2024, 3, 2, 17, 41),
        datetime(2023, 4, 18, 11, 32),
    ],
    "label": [uuid4(), uuid4(), uuid4()],
    "description": ["Первый", None, "Третий"],
    "price": [231.73, 532.89, 50.7],
}
table = pandas.DataFrame(data)
instances = dataframe_to_dataclasses(table, "Test")
for i in instances:
    print(
        "class: ",
        i,
        "attribute_types:",
        type(i.id),
        type(i.date),
        type(i.moment),
        type(i.label),
        type(i.description),
        type(i.price),
    )

The code above will output to the console

class:  Test(id=1, date=datetime.date(2025, 4, 12), moment=Timestamp('2025-04-12 23:12:00'), label=UUID('5ad3582b-91f3-48b7-b904-9223ea867402'), description='Первый', price=231.73) attribute_types: <class 'int'> <class 'datetime.date'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'uuid.UUID'> <class 'str'> <class 'float'>
class:  Test(id=2, date=datetime.date(2024, 3, 2), moment=Timestamp('2024-03-02 17:41:00'), label=UUID('1be89b91-c940-42a7-8248-973ef99fd98d'), description=None, price=532.89) attribute_types: <class 'int'> <class 'datetime.date'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'uuid.UUID'> <class 'NoneType'> <class 'float'>
class:  Test(id=3, date=datetime.date(2023, 4, 18), moment=Timestamp('2023-04-18 11:32:00'), label=UUID('90451cf8-5d95-4bd4-8166-7c0d92b26990'), description='Третий', price=50.7) attribute_types: <class 'int'> <class 'datetime.date'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'uuid.UUID'> <class 'str'> <class 'float'>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.