Converting a numpy array into a dict of values mapped to rows

Question

Consider that I have a 2D numpy array where each row represents a unique item and each column within the row represents a label assigned to this item. For example, a 10 x 25 array in this instance would represent 10 items, each of which have up to 25 labels each.

What would be most efficient way to convert this to a dict (or another appropriate datatype, bonus points if it can be sorted by length) that maps labels to the rows indices in which that label occurs? For example, dict[1] would return a list of the row indices that contain 1 as a label.

For example,

Given:
    [1, 2, 3]
    [1, 0, 0]
    [1, 3, 0]

Result:
    1: 0, 1, 2 # 1 occurs in rows 0, 1, 2
    3: 0, 2    # 3 occurs in rows 0, 2
    0: 1, 2    # 0 occurs in rows 1, 2 (0 is padding for lack of labels)
    2: 0       # 2 occurs in row 0 only

I believe result should have 2: 0 because 2 only occurs in the 0th row? — jpp
– jpp, Commented May 22, 2018 at 19:51

Paul Panzer · Accepted Answer · 2018-05-22 20:08:14Z

4

UPDATE: added ordering by length.

We can use advanced indexing to create a grid indexed by items and labels. We can then iterate over columns and use flatnonzero to get the item id's:

>>> ex = [[1, 2, 3],
...       [1, 0, 0],
...       [1, 3, 0]]
>>> 
>>> m = len(ex)
>>> n = np.max(ex) + 1
>>> grid = np.zeros((m, n), int) # could also use a smaller dtype here
>>> grid[np.arange(m)[:, None], ex] = 1
>>> grid
array([[0, 1, 1, 1],
       [1, 1, 0, 0],
       [1, 1, 0, 1]])
>>> idx = np.argsort(np.count_nonzero(grid, 0))[::-1]
>>> dict(zip(idx, map(np.flatnonzero, grid.T[idx])))
{1: array([0, 1, 2]), 3: array([0, 2]), 0: array([1, 2]), 2: array([0])}

Note that dictionaries remember the insertion order of their keys. That is an implementation detail in 3.6 but will be a guaranteed feature in 3.7.

edited May 22, 2018 at 20:08

answered May 22, 2018 at 19:51

Paul Panzer

53.3k3 gold badges60 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jpp · Accepted Answer · 2018-05-22 19:50:31Z

3

You can use collections.defaultdict, before using OrderedDict to sort by number of observations:

import numpy as np
from collections import defaultdict, OrderedDict

A = np.array([[1, 2, 3],
              [1, 0, 0],
              [1, 3, 0]])

d = defaultdict(list)

for idx, row in enumerate(A):
    for i in set(row):
        d[i].append(idx)

res = OrderedDict(sorted(d.items(), key=lambda x: len(x[1]), reverse=True))

print(res)

OrderedDict([(1, [0, 1, 2]),
             (3, [0, 2]),
             (0, [1, 2]),
             (2, [0])])

answered May 22, 2018 at 19:50

jpp

166k37 gold badges301 silver badges362 bronze badges

Comments

Riolku · Accepted Answer · 2018-05-22 19:52:01Z

You can just define a {} and iterate through the array, adding values in as you go, like so:

def f(array):
    table = {} # Initialize the dict
    for rownumber, row in enumerate(array): # Goes through all of the rows, with associated numbering
        for element in set(row): # Deduplicate to avoid duplicate row numbers
            if element not in table: table[element] = [] # Initialize empty row list if this element is new
            table[element].append(rownumber+1) # Add the current row number to the associated list of rows
    return d

print(f([[1, 2, 3], [1, 0, 0], [1, 3, 0]]))

This approach is O(N²). This is achieved since set() is linear and is called N times. Also, set membership is constant time.

Collectives™ on Stack Overflow

Converting a numpy array into a dict of values mapped to rows

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related