0

Consider that I have a 2D numpy array where each row represents a unique item and each column within the row represents a label assigned to this item. For example, a 10 x 25 array in this instance would represent 10 items, each of which have up to 25 labels each.

What would be most efficient way to convert this to a dict (or another appropriate datatype, bonus points if it can be sorted by length) that maps labels to the rows indices in which that label occurs? For example, dict[1] would return a list of the row indices that contain 1 as a label.

For example,

Given:
    [1, 2, 3]
    [1, 0, 0]
    [1, 3, 0]

Result:
    1: 0, 1, 2 # 1 occurs in rows 0, 1, 2
    3: 0, 2    # 3 occurs in rows 0, 2
    0: 1, 2    # 0 occurs in rows 1, 2 (0 is padding for lack of labels)
    2: 0       # 2 occurs in row 0 only
3
  • Please share some example data and expected output. Commented May 22, 2018 at 19:32
  • @Tgsmith61591 Added a toy example Commented May 22, 2018 at 19:39
  • 1
    I believe result should have 2: 0 because 2 only occurs in the 0th row? Commented May 22, 2018 at 19:51

3 Answers 3

4

UPDATE: added ordering by length.

We can use advanced indexing to create a grid indexed by items and labels. We can then iterate over columns and use flatnonzero to get the item id's:

>>> ex = [[1, 2, 3],
...       [1, 0, 0],
...       [1, 3, 0]]
>>> 
>>> m = len(ex)
>>> n = np.max(ex) + 1
>>> grid = np.zeros((m, n), int) # could also use a smaller dtype here
>>> grid[np.arange(m)[:, None], ex] = 1
>>> grid
array([[0, 1, 1, 1],
       [1, 1, 0, 0],
       [1, 1, 0, 1]])
>>> idx = np.argsort(np.count_nonzero(grid, 0))[::-1]
>>> dict(zip(idx, map(np.flatnonzero, grid.T[idx])))
{1: array([0, 1, 2]), 3: array([0, 2]), 0: array([1, 2]), 2: array([0])}

Note that dictionaries remember the insertion order of their keys. That is an implementation detail in 3.6 but will be a guaranteed feature in 3.7.

Sign up to request clarification or add additional context in comments.

Comments

3

You can use collections.defaultdict, before using OrderedDict to sort by number of observations:

import numpy as np
from collections import defaultdict, OrderedDict

A = np.array([[1, 2, 3],
              [1, 0, 0],
              [1, 3, 0]])

d = defaultdict(list)

for idx, row in enumerate(A):
    for i in set(row):
        d[i].append(idx)

res = OrderedDict(sorted(d.items(), key=lambda x: len(x[1]), reverse=True))

print(res)

OrderedDict([(1, [0, 1, 2]),
             (3, [0, 2]),
             (0, [1, 2]),
             (2, [0])])

Comments

1

You can just define a {} and iterate through the array, adding values in as you go, like so:

def f(array):
    table = {} # Initialize the dict
    for rownumber, row in enumerate(array): # Goes through all of the rows, with associated numbering
        for element in set(row): # Deduplicate to avoid duplicate row numbers
            if element not in table: table[element] = [] # Initialize empty row list if this element is new
            table[element].append(rownumber+1) # Add the current row number to the associated list of rows
    return d

print(f([[1, 2, 3], [1, 0, 0], [1, 3, 0]]))

This approach is O(N2). This is achieved since set() is linear and is called N times. Also, set membership is constant time.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.