Optimizing DataFrame iteration when generating large hierarchical text files

Question

I have a custom object which stores dataframes in memory given a certain hierarchy, and I want to store this data in a file while maintaining the hierarchy. This hierarchy involved parents, children, and grandchildren and so on. Here is a small example:

df1 = {‘A’: [1,2,3,4], ‘B’: [5,6,7,8]}
df2 = {‘C’: [4,3,2,1], ‘D’: [8,7,6,5]}
df3 = {‘E’: [2,3,4,5], ‘F’: [6,7,8,9]}

Final output (pipes are important):
|1|5|     # ←--- From df1
|4|8|     # ←--- From df2
|3|7|
|2|6|
|1|5|
|2|6|     # ←--- Back to df1
…         # ←--- df3 starts here

Here is a minimal working example of what I want to achieve; however, it is not a scalable solution to a file with over one million lines of text (hardly scales to thousands).

import pandas as pd
import numpy as np
from time import time

# DATA STRUCTURE
class Block:
    def __init__(self, name: str, data: dict[str, pd.DataFrame]):
        self.name = name
        self.registries = data

    def __getitem__(self, registry: str) -> pd.DataFrame:
        """ Allow for dictionary stryle access. """
        return self.registries[registry]

class Sped:
    def __init__(self):
        self.blocks: dict[str, Block] = {}

    def __getitem__(self, name: str) -> Block:
        return self.blocks[name]

    def add_block(self, block: Block) -> None:
        self.blocks[block.name] = block

# GENERATE FAKE DATA
n_rows = 1000
n_cols = 5

random_data = np.random.randint(0, 100, size=(n_rows, n_cols))

column_names = ['Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']

grandchild = pd.DataFrame(random_data, columns=column_names)

random_data = np.random.randint(0, 100, size=(2, n_cols))
child = pd.DataFrame(random_data, columns=column_names)

random_data = np.random.randint(0, 100, size=(1, n_cols))
parent = pd.DataFrame(random_data, columns=column_names)

parent = parent.astype(str)
child = child.astype(str)
grandchild = grandchild.astype(str)

parent["id"] = 1
child["id"] = [1,2]
child["parent_id"] = 1

grandchild["id"] = range(n_rows)
grandchild["parent_id"] = 1

parent = parent[["id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]
child = child[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]
grandchild = grandchild[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]

parent_dict = {"Parent": parent}
child_dict = {"Child": child}
grandchild_dict = {"Grandchild": grandchild}

parent_block = Block("Parent", parent_dict)
child_block = Block("Child", child_dict)
grandchild_block = Block("Grandchild", grandchild_dict)

sped = Sped()
sped.add_block(parent_block)
sped.add_block(child_block)
sped.add_block(grandchild_block)

# FUNCTION TO BE OPTIMIZED
def _save_row(block, registry, row_index, column_id, parent_id):
    # Work on a copy to not destroy the original data in memory.
    temp_table = sped[block][registry].copy()
    # Cast id column from int to object to replace it with a string.
    temp_table = temp_table.astype("object")
    # Set the column id as the text version of the sped.
    temp_table.loc[row_index, column_id] = f"|{registry}"
    # Remove parent ID as they don't exist in speds. Some registries don't have parents so they pass None (idealy).
    if parent_id:
        temp_table.drop(parent_id, axis=1, inplace=True)
    # Create one string from the data in sped format.
    return "|".join(temp_table.values[row_index]) + "|\n"

# USAGE OF FUNCTION (WHICH CAN ALSO BE OPTIMIZED)
start = time()

sped_lines = []
sped_lines.append(_save_row("Parent", "Parent", 0, "id", None))
sped_lines.append(_save_row("Child", "Child", 0, "id", "parent_id"))
for i in range(len(sped["Grandchild"]["Grandchild"])):
    sped_lines.append(_save_row("Grandchild", "Grandchild", i, "id", "parent_id"))
sped_lines.append(_save_row("Child", "Child", 1, "id", "parent_id"))

with open("temp.txt", 'w', encoding="utf-8") as fp:
    fp.writelines(sped_lines)
print(f"Time take to write sped (in seconds): {time() - start}")

How can I optimize this solution to run faster?

toolic · Accepted Answer · 2025-11-06 13:27:12Z

Here are some minor coding style suggestions.

Simpler

When I run the code, I see output like:

Time take to write sped (in seconds): 0.9321348667144775

You could show fewer digits of precision to make the output simpler. Change:

print(f"Time take to write sped (in seconds): {time() - start}")

to:

print(f"Time take to write sped (in seconds): {time() - start:.2f}")

New output:

Time take to write sped (in seconds): 0.92

Documentation

The PEP 8 style guide recommends adding docstrings for classes.

class Sped:

It would be helpful to elaborate on what "Sped" means in your docstring because it is not clear from the context. If it is an acronym, it should be spelled out in the docstring. The word "sped" in English is past tense of the verb "to speed". If that is what is meant here, then it should be clarified.

There is a typo in the __getitem__ docstring: "stryle". Fixed as:

def __getitem__(self, registry: str) -> pd.DataFrame:
    """ Allow for dictionary style access. """

Another typo ("idealy") in this comment:

# Remove parent ID as they don't exist in speds. Some registries don't have parents so they pass None (idealy).

It should be "ideally".

DRY

This code is repeated several times:

['Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']

Since you already assigned it to a variable (column_names), you can use the variable in other assignments such as:

child = child[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]

Also, you could have created the column_names array in a more scalable way (such as a loop).

Naming

In the _save_row function, the variable name temp_table could simply be named table. It is a good practice to avoid naming variables as "temp".

Also in that function, it is uncommon to have a comment for each line of code. This can make the code hard to read and understand. Consider summarizing all the comments in a single docstring at the top of the function.

Time

Your time calculation is done after you write the results to the output file. It might be more meaningful to exclude the disk I/O for your purposes, or at least separate the two.

J_H · Accepted Answer · 2025-11-07 14:59:13Z

consistent naming

There is almost nothing to the Block class, yet we wind up with this:

        self.name = name
        self.registries = data

Prefer to accept a formal parameter of registries in the signature. Or perhaps you wanted the object attribute to be named data.

Even before I noticed that detail, I was going to suggest to DRY this up slightly by making it a @dataclass. It reduces boilerplate verbiage. Also it neatly sidesteps inconsistencies like "registries" != "data", as then such a situation simply cannot arise.

Adding __getitem__ to a dataclass works just the same way; it's not like you lose any flexibility.

missing comment

class Sped:

That's not an English noun, I don't know what it is, and you did not explain it. We need a """comment""" here. If it is a term of art drawn from the business domain, then the Review Context narrative should have described something less vague than "a certain hierarchy".

column_names = ['Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']

You're going to a lot of trouble to hide the true Business Domain from reviewers and maintainers. That makes our job more difficult.

temp var reuse

random_data = np.random.randint(0, 100, size=(n_rows, n_cols))
...
random_data = np.random.randint(0, 100, size=(2, n_cols))
...
random_data = np.random.randint(0, 100, size=(1, n_cols))

You're asking me to reason about a domain that you have deliberately obfuscated, and now you're using same name for three different concepts. Please don't do that. The machine doesn't care, but names mean something to people. They shape our thoughts, affect how we reason about the code. When there's a new meaning for the bits, they deserve a new name.

Clearly the correct names would have been

grandchild_data = ...
child_data = ...
parent_data = ...

In a similar vein, prefer to assemble the DataFrame in the form you wish all at once:

parent = pd.DataFrame(parent_data, columns=column_names).astype(str)

splat

column_names =                             ['Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']
...
parent =                      parent[["id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]
child =           child[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]
grandchild = grandchild[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]

This really needs to be DRY'd up. Python's * splat operator can help.

parent =                      parent[["id", *column_names]]
child =           child[["id", "parent_id", *column_names]]
grandchild = grandchild[["id", "parent_id", *column_names]]

Consider defining the "id" column before filling in random data, so the columns are ordered the way you want them. Alternatively you might assemble all columns beforehand and then a single call to pd.DataFrame() is enough to populate the frame with proper data in the proper order.

one row instead of whole table

Let's look at the _save_row() helper.

First it makes a .copy() of the whole table. Now, that doesn't have to be expensive. It can happen in O(1) constant time, thanks to CoW.

But then we convert the whole thing to string type, dirtying every row, which will take O(n) linear time. You want to either convert the whole thing to string just once, or stick to converting just a single row at a time.

In the parent case we do

    temp_table.drop(parent_id, axis=1, inplace=True)

Standard advice nowadays is that inplace was a poor design choice; prefer to re-assign temp_table = temp_table.drop( ... ). That avoids some sticky edge cases, which the OP code has not yet encountered though it may as development continues.

The loop which calls this helper is pretty simple. Imagine we wrote the grandchild frame to a table. I'm not yet seeing it, but it seems like we should be able to write a single SELECT query that accomplishes what the loop does.

Reinderien · Accepted Answer · 2025-11-08 17:59:46Z

Long story short, use more Pandas and use fewer lists and classes. Whereas - given the current number of blocks and registries - representing the subframes as separate objects isn't a big deal, it isn't very Pandas-idiomatic and a single dataframe should probably be used instead. Note the suggested format that keeps block and registry names in columns rather than class attributes.

More importantly, don't do row-wise string formatting; issue a single call to .to_csv() with custom field and line separators.

For the purposes of comparison, I've omitted the conversion from the old memory format to the new memory format from timings, and only showed the comparison for the rendering step. In one test on my machine, the refactored code offers a speedup of

$$ \frac {0.8472 \cdot 500003 } {0.7923 \cdot 1003 } \approx 533 $$

import difflib
import io
import os
import sys
import typing

import pandas as pd
import numpy as np
import time


class BlockOP:
    """Data structure"""

    def __init__(self, name: str, data: dict[str, pd.DataFrame]) -> None:
        self.name = name
        self.registries = data

    def __getitem__(self, registry: str) -> pd.DataFrame:
        """ Allow for dictionary-style access. """
        return self.registries[registry]


class SpedOP:
    def __init__(self) -> None:
        self.blocks: dict[str, BlockOP] = {}

    def __getitem__(self, name: str) -> BlockOP:
        return self.blocks[name]

    def add_block(self, block: BlockOP) -> None:
        self.blocks[block.name] = block


def generate_fake_data(
    rand: np.random.Generator, n_rows: int = 1000, n_cols: int = 5,
) -> SpedOP:
    random_data = rand.integers(0, 100, size=(n_rows, n_cols))
    column_names = ['Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']
    grandchild = pd.DataFrame(random_data, columns=column_names)

    random_data = rand.integers(0, 100, size=(2, n_cols))
    child = pd.DataFrame(random_data, columns=column_names)

    random_data = rand.integers(0, 100, size=(1, n_cols))
    parent = pd.DataFrame(random_data, columns=column_names)

    parent = parent.astype(str)
    child = child.astype(str)
    grandchild = grandchild.astype(str)

    parent["id"] = 1
    child["id"] = [1,2]
    child["parent_id"] = 1

    grandchild["id"] = range(n_rows)
    grandchild["parent_id"] = 1

    parent = parent[["id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]
    child = child[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]
    grandchild = grandchild[["id", "parent_id", 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5']]

    parent_dict = {"Parent": parent}
    child_dict = {"Child": child}
    grandchild_dict = {"Grandchild": grandchild}

    parent_block = BlockOP("Parent", parent_dict)
    child_block = BlockOP("Child", child_dict)
    grandchild_block = BlockOP("Grandchild", grandchild_dict)

    sped = SpedOP()
    sped.add_block(parent_block)
    sped.add_block(child_block)
    sped.add_block(grandchild_block)
    return sped


def sped_to_df(sped: SpedOP) -> pd.DataFrame:
    def out_of_place(block_name: str, reg_name: str, registry: pd.DataFrame) -> pd.DataFrame:
        df = registry.copy()
        df.insert(loc=0, column='block', value=block_name)
        df.insert(loc=1, column='registry', value=reg_name)
        return df

    return pd.concat(
        [
            out_of_place(block_name, reg_name,registry)
            for block_name, block in sped.blocks.items()
            for reg_name, registry in block.registries.items()
        ], axis='rows',
    )


def _save_row_op(
    sped: SpedOP,
    block: str,
    registry: str,
    row_index: int,
    column_id: str,
    parent_id: str | None,
) -> str:
    # Work on a copy to not destroy the original data in memory.
    temp_table = sped[block][registry].copy()

    # Cast id column from int to object to replace it with a string.
    temp_table = temp_table.astype("object")

    # Set the column id as the text version of the sped.
    temp_table.loc[row_index, column_id] = f"|{registry}"

    # Remove parent ID as they don't exist in speds. Some registries don't have parents so they pass None (idealy).
    if parent_id:
        temp_table.drop(parent_id, axis=1, inplace=True)

    # Create one string from the data in sped format.
    return "|".join(temp_table.values[row_index]) + "|\n"


def write_sped_op(sped: SpedOP, fp: typing.TextIO) -> None:
    """Usage of function (which can also be optimized)"""
    sped_lines = []
    sped_lines.append(_save_row_op(sped, "Parent", "Parent", 0, "id", None))
    sped_lines.append(_save_row_op(sped, "Child", "Child", 0, "id", "parent_id"))
    for i in range(len(sped["Grandchild"]["Grandchild"])):
        sped_lines.append(_save_row_op(sped, "Grandchild", "Grandchild", i, "id", "parent_id"))
    sped_lines.append(_save_row_op(sped, "Child", "Child", 1, "id", "parent_id"))
    fp.writelines(sped_lines)


def process_sped(
    df: pd.DataFrame,
    column_id: str = 'id',
    parent_id: str = 'parent_id',
) -> pd.DataFrame:
    # Unconditionally drop everything except registry, column_1, ... _5
    keep_cols = df.columns[~df.columns.isin({
        'block', column_id, parent_id,
    })]
    children = df.loc[df['block'] == 'Child', keep_cols]
    return pd.concat(
        (
            df.loc[df['block'] == 'Parent', keep_cols],
            children.iloc[0:1],
            df.loc[df['block'] == 'Grandchild', keep_cols],
            children.iloc[1:2],  # should this really be based on position and not e.g. id?
        ), axis='rows',
    )


def write_sped(df: pd.DataFrame, file: typing.TextIO) -> None:
    file.write('|')
    df.to_csv(
        path_or_buf=file, index=False, header=False, sep='|', lineterminator='|\n|',
    )
    file.seek(file.tell()-1, os.SEEK_SET)
    file.truncate()


def regression_test() -> None:
    rand = np.random.default_rng(seed=0)
    sped_op = generate_fake_data(rand, n_rows=17)
    sped_df = sped_to_df(sped_op)

    with io.StringIO() as f:
        t0 = time.perf_counter()
        write_sped_op(sped_op, f)
        t_op = time.perf_counter() - t0
        text_op = f.getvalue()

    with io.StringIO() as f:
        t0 = time.perf_counter()
        df = process_sped(sped_df)
        write_sped(df, f)
        t_refactored = time.perf_counter() - t0
        text_refactored = f.getvalue()

    if text_op != text_refactored:
        sys.stderr.writelines(
            difflib.context_diff(
                text_op, text_refactored, fromfile='original', tofile='refactored',
            )
        )
        raise AssertionError('Text output is different')

    print(f'Small benchmark, n={len(sped_df)}: {t_op:.4f} -> {t_refactored:.4f} s')


def benchmark() -> None:
    rand = np.random.default_rng(seed=0)
    sped_op = generate_fake_data(rand, n_rows=1000)
    sped_df = sped_to_df(sped_op)

    with io.StringIO() as f:
        t0 = time.perf_counter()
        write_sped_op(sped_op, f)
        t_op = time.perf_counter() - t0

    with io.StringIO() as f:
        t0 = time.perf_counter()
        df = process_sped(sped_df)
        write_sped(df, f)
        t_refactored = time.perf_counter() - t0

    print(f'Mid benchmark, n={len(sped_df)}: {t_op:.4f} -> {t_refactored:.4f} s')


def benchmark_newonly() -> None:
    rand = np.random.default_rng(seed=0)
    sped_op = generate_fake_data(rand, n_rows=500_000)
    sped_df = sped_to_df(sped_op)

    with io.StringIO() as f:
        t0 = time.perf_counter()
        df = process_sped(sped_df)
        write_sped(df, f)
        t_refactored = time.perf_counter() - t0

    print(f'Large benchmark, refactored only, n={len(sped_df)}: {t_refactored:.4f} s')


if __name__ == '__main__':
    regression_test()
    benchmark()
    benchmark_newonly()

Small benchmark, n=20: 0.0127 -> 0.0035 s
Mid benchmark, n=1003: 0.8472 -> 0.0035 s
Large benchmark, refactored only, n=500003: 0.7923 s

Your code throws ValueError: bad delimiter or lineterminator value on the line df.to_csv( path_or_buf=file, index=False, header=False, sep='|', lineterminator='|\n|',). I beleive this has to do with your pandas version. Can you update your code to work with a more recent pandas version such as 2.3.2? — Marcus Carpenter
– Marcus Carpenter, Commented Nov 10 at 12:48
@MarcusCarpenter I'm on 2.2.3 and it doesn't throw an error. Likewise on 2.3.3. — Reinderien
– Reinderien, Commented Nov 10 at 23:06

Stack Exchange Network

Optimizing DataFrame iteration when generating large hierarchical text files

3 Answers 3

Simpler

Documentation

DRY

Naming

Time

consistent naming

missing comment

temp var reuse

splat

one row instead of whole table

You must log in to answer this question.

Hot Network Questions

3 Answers 3

Simpler

Documentation

Naming

Time

consistent naming

missing comment

temp var reuse

splat

one row instead of whole table

You must log in to answer this question.

Related