1

I wonder if there is another way to speed up my multiple nested for loops in a matrix function.

Here is my function:

def matrix(Xbin, y):
    labels = np.unique(y)
    con_matrix = []
    start = time.time()
    for i in range(len(labels)):
        for j in range(i + 1, len(labels)):
            # Crossover
            for u in Xbin[y == labels[i]]:
                for v in Xbin[y == labels[j]]:
                    con_matrix.append(np.bitwise_xor(u, v))
    end = time.time()
    duration = end - start
    print("total time for nested loop: ", duration)
    constraint_matrix = np.array(con_matrix)
    bin_attr_dim = [i for i in range(1, Xbin.shape[1] + 1)]

    df = pd.DataFrame(constraint_matrix, columns=bin_attr_dim)

    return df

Please note that Xbin is a numpy.ndarray. y denotes the unique group (1, 2, 3). Picture below denotes that kind of ndarray (start with column a to h) Figure 1:

enter image description here

My function matrix described above will generate the output as a DataFrame corresponding to the picture below (columns a to h). It is the combination between elements in different groups. Figure 2:

enter image description here

Here is my code to generate a dataset in binarize format as shown in Figure 1:

def binarize_dataset(X, y):
    cutpoints = {}

    att = -1
    for row in X.T:
        att += 1
        labels = None  # Previous labels
        u = -9999  # Previous xi

        # Finding transitions
        for v in sorted(np.unique(row)):
            variation = v - u  # Current - Previous

            # Classes where current v appears
            indexes = np.where(row == v)[0]
            # current label
            __labels = set(y[indexes])

            # Main condition
            if labels is not None and variation > 0:
                # Testing for transition to find the essential cut-points
                if (len(labels) > 1 or len(__labels) > 1) or labels != __labels:
                    # cut-point id
                    cid = len(cutpoints)
                    cutpoints[cid] = (att, u + variation / 2.0)

            labels = __labels
            # previous equals current
            u = v
    new_dict = {}
    # Iterate over the values in the original dictionary
    for key, value in cutpoints.items():
        first_element = value[0]
        second_element = value[1]

        # Check if the first_element is already a key in new_dict
        if first_element in new_dict:
            new_dict[first_element].append(second_element)
        else:
            new_dict[first_element] = [second_element]
    # Generate combinations of the second elements within each group
    for key, value in new_dict.items():
        comb = combinations(value, 2)
        # Append the combinations to the value list
        for c in comb:
            new_dict[key].append(c)
    arrays = []
    for attr, cutpoints in new_dict.items():
        for cutpoint in cutpoints:
            row = X.T[attr]
            if isinstance(cutpoint, tuple):
                lowerbound = cutpoint[0] <= row.reshape(X.shape[0], 1)
                upperbound = row.reshape(X.shape[0], 1) < cutpoint[1]
                row = np.logical_and(lowerbound, upperbound)
                arrays.append(row)
            else:
                row = row.reshape(X.shape[0], 1) >= cutpoint
                arrays.append(row)
    
    Xbin = np.concatenate(arrays, axis=1)

    bin_attr_dim = [i for i in range(1, Xbin.shape[1] + 1)]
    df = pd.DataFrame(Xbin, columns=bin_attr_dim)
    start = 0
    dict_parent_children = {}
    for key, list_value in new_dict.items():
        dict_parent_children[key] = list(df.columns[start: start + len(list_value)])
        start += len(list_value)

    return Xbin, df, dict_parent_children

When I tested with iris dataset which is a small dataset, it works really fast.

X, y = datasets.load_iris(return_X_y=True)
bin_dataset, data, dict_parent_children = binarize_dataset(X, y)

con_matrix = matrix(bin_dataset, y)

When I tested with a bigger dataset like breast cancer, it started getting longer and longer.

X, y = datasets.load_breast_cancer(return_X_y=True)
bin_dataset, data, dict_parent_children = binarize_dataset(X, y)
con_matrix = matrix(bin_dataset, y)

Imagine testing with a dataset bigger than breast cancer, the question is how can I speed up my function matrix as fast as possible in this case or is there a faster way to rewrite a faster matrix function?

8
  • Does this answer your question? How to speed up nested for loops in Python Commented Aug 8, 2023 at 4:19
  • 1
    Sadly no. I looked at this in advance already. Commented Aug 8, 2023 at 4:21
  • 1
    When it comes to numpy, you should avoid writing for loops at all costs if you're looking for speed. Commented Aug 8, 2023 at 4:26
  • 1
    Well, you're going to have to write a minimal reproducible example, meaning you should simplify your code down to just what's important for the question and nothing more; your code is longer than most people want to read. Commented Aug 8, 2023 at 4:56
  • 1
    When using numpy the obvious step to speeding up loops is to replace them with whole-array methods ("vectorized"). They still loop, but in compiled code. Sometimes that's easy to do (with experience). Sometimes it takes some tricks. Or focus on one loop at a time, or looking for some top-down pattern. But as @jared wrote, if the problem/code is too big, most of us will jut move on to a more interesting and focused question. Commented Aug 8, 2023 at 5:47

1 Answer 1

0

I tried on the breast cancer dataset using the data binarization code you provided and ran into issues myself. I had to change my code somewhat as the intermediary arrays became very large. So I switched to a less vectorized solution. A potential improvement upon this again would maybe be to do it on multiple processes, but I am not sure how efficiently the data will be shared across the processes.

However, are you aware that the result will take over 30 gigabytes of memory? Anyways here is the code:

The whole script required around 31 seconds on my Macbook Pro with the M2 Pro CPU

import numpy as np
import pandas as pd
from numpy.typing import NDArray
from itertools import product, pairwise, chain, combinations
from numpy.lib.stride_tricks import sliding_window_view
from sys import getsizeof


def binarize_dataset(X, y):
    """This is the binarization code you provided"""
    cutpoints = {}

    att = -1
    for row in X.T:
        att += 1
        labels = None  # Previous labels
        u = -9999  # Previous xi

        # Finding transitions
        for v in sorted(np.unique(row)):
            variation = v - u  # Current - Previous

            # Classes where current v appears
            indexes = np.where(row == v)[0]
            # current label
            __labels = set(y[indexes])

            # Main condition
            if labels is not None and variation > 0:
                # Testing for transition to find the essential cut-points
                if (len(labels) > 1 or len(__labels) > 1) or labels != __labels:
                    # cut-point id
                    cid = len(cutpoints)
                    cutpoints[cid] = (att, u + variation / 2.0)

            labels = __labels
            # previous equals current
            u = v
    new_dict = {}
    # Iterate over the values in the original dictionary
    for key, value in cutpoints.items():
        first_element = value[0]
        second_element = value[1]

        # Check if the first_element is already a key in new_dict
        if first_element in new_dict:
            new_dict[first_element].append(second_element)
        else:
            new_dict[first_element] = [second_element]
    # Generate combinations of the second elements within each group
    for key, value in new_dict.items():
        comb = combinations(value, 2)
        # Append the combinations to the value list
        for c in comb:
            new_dict[key].append(c)
    arrays = []
    for attr, cutpoints in new_dict.items():
        for cutpoint in cutpoints:
            row = X.T[attr]
            if isinstance(cutpoint, tuple):
                lowerbound = cutpoint[0] <= row.reshape(X.shape[0], 1)
                upperbound = row.reshape(X.shape[0], 1) < cutpoint[1]
                row = np.logical_and(lowerbound, upperbound)
                arrays.append(row)
            else:
                row = row.reshape(X.shape[0], 1) >= cutpoint
                arrays.append(row)

    Xbin = np.concatenate(arrays, axis=1)

    bin_attr_dim = [i for i in range(1, Xbin.shape[1] + 1)]
    df = pd.DataFrame(Xbin, columns=bin_attr_dim)
    start = 0
    dict_parent_children = {}
    for key, list_value in new_dict.items():
        dict_parent_children[key] = list(df.columns[start : start + len(list_value)])
        start += len(list_value)

    return Xbin, df, dict_parent_children


def xor_grouped_pairs(X: NDArray, y: NDArray, y_is_sorted: bool = False):
    """
    If X is already sorted respective to y you can save some time by specifying that it is already sorted
    """
    if not y_is_sorted:
        X = X[y.argsort()]

    pairs = np.concatenate(([0], np.cumsum(np.unique(y, return_counts=True)[1])))
    pairs = sliding_window_view(pairs, 2)
    pairs = pairwise(pairs)
    pairs = chain.from_iterable(product(range(*l), range(*r)) for l, r in pairs)

    xor_matrix = np.array([np.bitwise_xor(X[a], X[b]) for a, b in pairs], dtype=np.uint8)
    return pd.DataFrame(data=xor_matrix)


if __name__ == "__main__":
    X = np.array(
        [
            [0, 1, 0, 1, 0, 1, 1, 0],
            [1, 1, 0, 1, 1, 0, 0, 1],
            [0, 1, 1, 0, 1, 0, 0, 1],
            [1, 0, 1, 0, 1, 0, 1, 1],
            [0, 0, 0, 1, 1, 1, 0, 0],
            [1, 1, 0, 1, 0, 1, 0, 1],
            [0, 0, 1, 0, 1, 0, 1, 0],
        ]
    )

    y = np.array([1, 1, 1, 2, 2, 3, 3])
    print(f"Result on the small example:")
    print(xor_grouped_pairs(X, y))

    from sklearn import datasets
    X, y = datasets.load_breast_cancer(return_X_y=True)
    bin_dataset, data, dict_parent_children = binarize_dataset(X, y)
    print(f"The size of the binarized dataset is {bin_dataset.shape}, memory={getsizeof(bin_dataset) / 1e9} GB")
    print("Result on binarized breast cancer dataset:")
    print(result := xor_grouped_pairs(bin_dataset, y))
    print(f"Memory usage to store the result of binarized breast cancer: {getsizeof(result) / 1e9} GB")

Output:

Result on the small example:
   0  1  2  3  4  5  6  7
0  1  1  1  1  1  1  0  1
1  0  1  0  0  1  0  1  0
2  0  1  1  1  0  0  1  0
3  1  1  0  0  0  1  0  1
4  1  1  0  0  0  0  1  0
5  0  1  1  1  0  1  0  1
6  0  1  1  1  1  1  1  0
7  1  0  0  0  0  0  0  1
8  1  1  0  0  1  0  0  1
9  0  0  1  1  0  1  1  0
The size of the binarized dataset is (569, 487179), memory=0.277204979 GB
Result on binarized breast cancer dataset:
       0       1       2       3       4       5       6       ...  487172  487173  487174  487175  487176  487177  487178
0           0       0       0       0       0       0       0  ...       0       0       0       0       0       0       0
1           0       0       0       0       0       0       0  ...       0       0       0       0       0       0       0
2           1       1       1       1       1       1       1  ...       0       0       0       0       0       0       0
3           1       1       1       1       1       1       1  ...       0       0       0       0       0       0       0
4           0       0       0       0       0       0       0  ...       0       0       0       0       0       0       0
...       ...     ...     ...     ...     ...     ...     ...  ...     ...     ...     ...     ...     ...     ...     ...
75679       1       1       1       1       1       1       1  ...       0       0       0       0       0       0       0
75680       0       0       0       0       0       0       0  ...       0       0       0       0       0       0       0
75681       0       0       0       0       0       0       0  ...       0       0       0       0       0       0       0
75682       0       0       0       0       0       0       1  ...       0       0       0       0       0       0       0
75683       1       1       1       1       1       1       1  ...       0       0       0       0       0       0       0

[75684 rows x 487179 columns]
Memory usage to store the result of binarized breast cancer: 36.87165558 GB

Sign up to request clarification or add additional context in comments.

2 Comments

Can you test it with actual dataset like breast cancer?
I have tested it now, see my updated answer. I had to do some tweaks to my code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.