4

The purpose is to reduce memory usage. Meaning that it should be optimized in a way that the hash is equal to the test hash.

What I've tried so far:

  1. Adding __slots__ but it didn't make any changes.
  2. Change default dtype float64 to float32. Although it reduces the mem usage significantly, it brakes the test by changing the hash.
  3. Converted data into np.array reduced CPU times: from 13 s to 2.05 s but didn't affect the memory usage.

The code to reproduce:

rows = 40000000
trs = 10

random.seed(42)

generated_data: tp.List[float] = np.array([random.random() for _ in range(rows)])



def df_upd(df_initial: pd.DataFrame, df_new: pd.DataFrame) -> pd.DataFrame:
    return pd.concat((df_initial, df_new), axis=1)


class T:
    """adding a column of random data"""
    __slots__ = ['var']
    def __init__(self, var: float):
        self.var = var

    def transform(self, df_initial: pd.DataFrame) -> pd.DataFrame:
        return df_upd(df_initial, pd.DataFrame({self.var: generated_data}))


class Pipeline:
    __slots__ = ['df', 'transforms']
    def __init__(self):
        self.df = pd.DataFrame()
        self.transforms = np.array([T(f"v{i}") for i in range(trs)])

    def run(self):
        for t in self.transforms:
            self.df = t.transform(self.df)
        return self.df


if __name__ == "__main__":
    
    
    # starting the monitoring
    tracemalloc.start()

    # function call
    pipe = Pipeline()
    %time df = pipe.run()
    print("running")

    # displaying the memory
    current, peak = tracemalloc.get_traced_memory()
    print(f"Current memory usage is {current / 10**3} KB ({(current / 10**3)*0.001} MB); Peak was {peak / 10**3} KB ({(peak / 10**3)*0.001} MB); Diff = {(peak - current) / 10**3} KB ({((peak - current) / 10**3)*0.001} MB)")

    # stopping the library
    tracemalloc.stop()
    
    # should stay unchanged
    %time hashed_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
    print("hashed_df", hashed_df)    
    
    assert hashed_df == test_hash

    print("Success!")
2
  • It seems to me this question is more suited to be asked in the Code Review Forum. Code Review is a question and answer site for peer programmer code reviews. Please read the relevant guidance related to how to properly ask questions on this site before posting your question. Commented Nov 12, 2022 at 0:41
  • 2
    At a first glance it looks like your code would not pass type checking. E.g. generated_data: tp.List[float] should probably be closer to generated_data: np.ndarray, etc. Commented Nov 15, 2022 at 17:17

1 Answer 1

3
+50

If you avoid pd.concat() and use the preferred way of augmenting dataframes:

df["new_col_name"] = new_col_data

this will reduce peak memory consumption significantly.


In your code it is sufficient to fix the Transform class:

class Transform:
    """adding a column of random data"""
    __slots__ = ['var']
    def __init__(self, var: str):
        self.var = var

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        df[self.var] = generated_data
        return df

(Note that I also changed the type of var from float to str to reflect how it is used in the code).


In my machine I went from:

Current memory usage is 1600110.987 KB (1600.110987 MB); Peak was 4480116.325 KB (4480.116325 MB); Diff = 2880005.338 KB (2880.005338 MB)

to:

Current memory usage is 1760101.105 KB (1760.101105 MB); Peak was 1760103.477 KB (1760.1034769999999 MB); Diff = 2.372 KB (0.002372 MB)

(I am not sure why the current memory usage is slightly higher in this case).


For faster computation, you may want to do some pre-allocation.

To do that, you could replace, in Pipeline's __init__():

self.df = pd.DataFrame()

with:

self.df = pd.DataFrame(data=np.empty((rows, trs)), columns=[f"v{i}" for i in range(trs)])

If you want to get even faster, you can compute the DataFrame right away in the Pipeline's __init__, e.g.:

class Pipeline:
    __slots__ = ['df', 'transforms']
    def __init__(self):
        self.df = pd.DataFrame(data=generated_data[:, None] + np.zeros(trs)[None, :], columns=[f"v{i}" for i in range(trs)])

    def run(self):
        return self.df

but I assume your Transform is a proxy of a more complex operation and I am not sure this simplification is easy to adapt beyond the toy code in the question.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks it is really helped to reduce memory, but somehow it affected cpu times. It increased from 2 sec to 11 sec. Is there any way to avoid it?
@Vagner I updated the answer to address performances. Essentially you can either do some pre-allocation, or, even faster, create the DataFrame right away. Hope it helps!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.