2

I apologize for including too much information.

I apologize for not including enough information.

What do I want the code to do?

arrayCurveLocations is an unsorted ndarray with pairs of integer numbers: curveLocations and distinctCrossings.

If a curveLocations number appears more than once in the ndarray, I want to consolidate the curveLocations rows by adding together the corresponding distinctCrossings numbers.

The newly created ndarray does not need to be sorted.

"Sparse" values and numpy.bincount

If I correctly understand the term, the values of curveLocations are "sparse" values: they are rarely contiguous, and the difference between the largest and smallest values is large. If I am reading numpy.bincount correctly, "The length of out is equal to np.amax(x)+1." The algorithm calls this function once per iteration as it decrements n. The largest value is capped as a function of n: curveLocationsMAXIMUM: int = 1 << (2 * n + 4). If n is 25, for example, the number of bins will be approximately 2^54, and most of the bins would be empty.

idk, I'm lost

I think numpy.unique is so close to being what I need. The following example from the docs reconstructs the original ndarray with the array of unique values and the array of "inverse indices". The indices subscript the array of unique elements, though.

numpy.unique(ar, return_index=False, return_inverse=False, return_counts=False, axis=None, 
    *, equal_nan=True, sorted=True)

# Reconstruct the input array from the unique values and inverse:

>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> u, indices = np.unique(a, return_inverse=True)
>>> u
array([1, 2, 3, 4, 6])
>>> indices
array([0, 1, 4, 3, 1, 2, 1])
>>> u[indices]
array([1, 2, 6, 4, 2, 3, 2])

If I had an array of indices that subscripted the original array, I think the following would work.

def aggregateCurveLocations(arrayCurveLocations: DataArray2columns) -> DataArray3columns:
    u, indices4arrayCurveLocations = np.unique(arrayCurveLocations, return_fantasy_indices = True)
    ...
    consolidatedDistinctCrossings = np.add.reduceat(arrayCurveLocations, indices4arrayCurveLocations)

I think I'm missing a fundamental concept

I'm trying to deduplicate through consolidation, right? But the sort order is irrelevant. "If there are duplicates, then add them together" seems like a straightforward task, so I think I am overlooking something or misunderstanding something.

Appendices

Partial code to define the symbols

Full code, but as of this writing, it's a little annoying to run it:

from mapFolding.algorithms.oeisIDbyFormula import A000682
for n in range(3,40):
    print(A000682(n))
type DataArray2columns = numpy.ndarray[tuple[int, ...], numpy.dtype[numpy.uint64]]
type DataArray3columns = numpy.ndarray[tuple[int, ...], numpy.dtype[numpy.uint64]]

columnsArrayCurveGroups = columnsArrayTotal = 3
columnΩ: int = (columnsArrayTotal - columnsArrayTotal) - 1  # Something _feels_ right about this instead of `= -1`.
columnDistinctCrossings = columnΩ = columnΩ + 1
columnGroupAlpha = columnΩ = columnΩ + 1
columnGroupZulu = columnΩ = columnΩ + 1
if columnΩ != columnsArrayTotal - 1:
    message = f"Please inspect the code above this `if` check. '{columnsArrayTotal = }', therefore '{columnΩ = }' must be '{columnsArrayTotal - 1 = }' due to 'zero-indexing.'"
    raise ValueError(message)
del columnsArrayTotal, columnΩ

columnsArrayCurveLocations = columnsArrayTotal = 2
columnΩ: int = (columnsArrayTotal - columnsArrayTotal) - 1
columnDistinctCrossings = columnΩ = columnΩ + 1
columnCurveLocations = columnΩ = columnΩ + 1
if columnΩ != columnsArrayTotal - 1:
    message = f"Please inspect the code above this `if` check. '{columnsArrayTotal = }', therefore '{columnΩ = }' must be '{columnsArrayTotal - 1 = }' due to 'zero-indexing.'"
    raise ValueError(message)
del columnsArrayTotal, columnΩ

def aggregateCurveLocations(arrayCurveLocations: DataArray2columns) -> DataArray3columns:
    arrayCurveGroups: DataArray3columns = numpy.tile(
        A=numpy.unique(arrayCurveLocations[:, columnCurveLocations])
        , reps=(columnsArrayCurveGroups, 1)
        ).T
    arrayCurveGroups[:, columnDistinctCrossings] = 0
    numpy.add.at(
        arrayCurveGroups[:, columnDistinctCrossings]
        , numpy.searchsorted(
            a=arrayCurveGroups[:, columnCurveLocations]
            , v=arrayCurveLocations[:, columnCurveLocations])
        , arrayCurveLocations[:, columnDistinctCrossings]
    )
    # I'm computing groupZulu from curveLocations that are physically in `arrayCurveGroups`, so I'm using `columnCurveLocations`.
    numpy.bitwise_and(arrayCurveGroups[:, columnCurveLocations], numpy.uint64(groupZuluLocator64), out=arrayCurveGroups[:, columnGroupZulu])
    numpy.right_shift(arrayCurveGroups[:, columnGroupZulu], 1, out=arrayCurveGroups[:, columnGroupZulu])
    # NOTE Do not alphabetize these operations. This column has curveLocations data that groupZulu needs.
    arrayCurveGroups[:, columnGroupAlpha] &= groupAlphaLocator64
    return arrayCurveGroups

Why do I believe this function is slow?

In VS Code, I used the Flamegraph extension, which uses py-spy, to measure the algorithm's performance on 10 different values. The total running time of each line is displayed in the left column. These values are representative of the many tests I've run. The full run time in this test was 204 seconds, the aggregateCurveLocations function was 107 seconds, and numpy.unique and numpy.searchsorted total to 101 seconds. Relevant run times I couldn't copy and paste.

2
  • 1
    I'm trying to run the example code to check the bottlenecks as well, but it is crashing cause something is missing. Would you be able to add/clarify the following missing var in the and: groupZuluLocator64 ? Or possibly the specific aspect of the np.unique function that is missing that we can perhaps help append, as the 'return_inverse' arg does provide a perfect map back? Commented Aug 26 at 9:03
  • Yes, the full code is in github.com/hunterhogan/mapFolding. from mapFolding.algorithms.oeisIDbyFormula import A000682; for n in range(30,40): foldsTotal = A000682(n). I haven't finished integrating this OEIS ID into the normal process. Commented Aug 26 at 10:18

2 Answers 2

2

Your code shows a use of the at method of numpy.add, but I haven't followed it completely. There appears to be more going on there than simply consolidating a 2-d array.

Here's a simple example of how you could use np.unique (with return_inverse=True) and np.add.at to consolidate your array. I'll call the input array a to save typing :)

import numpy as np


def consolidate(a):
    u, inv = np.unique(a[:, 0], return_inverse=True)
    s = np.zeros_like(u)
    np.add.at(s, inv, a[:, 1])
    return np.column_stack((u, s))


a = np.array([
    [1, 3],
    [2, 5],
    [1, 7],
    [3, 2],
    [2, 6],
    [10, 100],
    [11, 125],
    [77, 500],
    [99, 450],
    [11, 50],
    [75, 400],
    [77, 300],
    [98, 100],
    [11, 25],
    [99, 125],
    [77, 25],
    [65, 350],
    [66, 150],
    [75, 200],
])


b = consolidate(a)
print(b)

Output:

[[  1  10]
 [  2  11]
 [  3   2]
 [ 10 100]
 [ 11 200]
 [ 65 350]
 [ 66 150]
 [ 75 600]
 [ 77 825]
 [ 98 100]
 [ 99 575]]
Sign up to request clarification or add additional context in comments.

Comments

1

Seems like you want to group by curveLocations and get the aggregation of distinctCrossings as sum.

import numpy as np
arrayCurveLocations = [
    [1, 3],
    [2, 5],
    [1, 7],
    [3, 2],
    [2, 6]
]
arrayCurveLocations = np.array(arrayCurveLocations)

If you want below as result:

[
    [1, 10],
    [2, 11],
    [3, 2]
]

Then I can show how it can be done with pandas.

import pandas as pd
df = pd.DataFrame(arrayCurveLocations, columns=['curveLocations', 'distinctCrossings'])
result = df.groupby('curveLocations', as_index=False)['distinctCrossings'].sum()
result_array = result.values
print(result_array)

If you want to rely on only numpy to solve your problem, then this Link could help.

1 Comment

ty. pandas.groupby is another reason I feel I must be overlooking something in numpy. I'm chained to numpy because I'll use numba or codon to accelerate the execution. The time complexity grows exponentially,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.