Performantly create a list of python dictionaries by combining a numpy array and a list of tuples

Question

I am trying to find a time and memory performant way to create a dictionary, by combining both a list of tuples and an 1d numpy array. Hereby, each dictionary should have the following structure:

{"id": <first elem from tuple>, "name": <second elem from tuple>, "text": <third elem from tuple>, "value": <elem from array>}

The list of tuples looks as follows:

list_of_tuples = [("id1", "name1", "text1"), ("id2", "name2", "text2")]

Also, the numpy array has the same amount of elements as the list and contains elements of type np.float16:

value_array = np.ndarray([0.42, np.nan])

Also, all NaN-values are to be filtered out. The result of above example should be:

{"id": "id1", "name": "name1", "text": "text1", "value": 0.42}

I did get it to work like this:

[
    dict(
        dict(zip(["id", "name", "text"], list_of_tuples[index])),
        **{"value": value},
    )
    for index, value in enumerate(value_array)
    if not (math.isnan(value))
]

However, this is horribly slow for many entries and using the index to get entries from the list feels wrong/ inefficient.

pecey · Accepted Answer · 2020-07-02 19:15:36Z

1

You can definitely work without having to use the index explicitly. This should provide an improvement in performance.

value_array_indices = np.argwhere(~np.isnan(value_array)) 
list_of_tuples = np.array(list_of_tuples)[value_array_indices[0]]
value_array = value_array[value_array_indices[0]]
[{"id": x[0], "name": x[1], "text": x[2], "value": v} for x,v in zip(list_of_tuples, value_array)]

answered Jul 2, 2020 at 19:15

pecey

6915 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sebastian Hätälä Over a year ago

grand, thank you! value_array_indices in your code gave me an array of 1-element arrays. accessing them with [0] then only gave me one element. that was easily fixed using: np.concatenate(np.argwhere(~np.isnan(value_array)), axis=0)

alani Over a year ago

You forgot to filter out of the NaNs, but it is easily fixed by putting an if not math.isnan(v) clause in your list comprehension.

pecey Over a year ago

@alaniwi: value_array_indices only captures the indices of non-nan values.

alani Over a year ago

@pecey Sorry yes, I see what you've done now - I didn't read it properly.

hpaulj Over a year ago

Since the last line involves iteration, I expect it would be faster if the zip arguments were lists. I also suspect that the if not math.isnan(v) clause would be faster than the argwhere approach. That may though depend on the proportion of nan. Since dict have to be created individually, numpy doesn't buy much, if anything.

|

alani · Accepted Answer · 2020-07-02 20:37:15Z

Looks like someone posted a similar solution while I was writing this, but I'll post it anyway because of the timings measured, and some words of explanation.

Using the code suggested below, and with test inputs of length a million (which include a single NaN), I am seeing it go down to under 30% of the time compared to the code in the question.

Time: 0.3486933708190918
{'id': 'id0', 'name': 'name0', 'text': 'text0', 'value': 0.0} {'id': 'id999999', 'name': 'name999999', 'text': 'text999999', 'value': 999999.0} 999999
Time: 1.2175893783569336
{'id': 'id0', 'name': 'name0', 'text': 'text0', 'value': 0.0} {'id': 'id999999', 'name': 'name999999', 'text': 'text999999', 'value': 999999.0} 999999

I think the difference here is partly not having to index the list of tuples, but also I suspect that a large part of it is not having to instantiate a zip object for every element. You are dealing with a small number of dictionary keys with the same names each time, so you really do not need the flexibility that zip affords here, and it is more straightforward just to create the dictionary from a simple explicit expression.

(The zip(list_of_tuples, value_array) is obviously only creating one zip object for the whole operation, so is not significant.)

I've also suggested from math import isnan here rather than doing an attribute lookup to get math.isnan every time, although the difference turns out to be relatively unimportant.

from math import isnan
import numpy as np
import time

# construct some test data
n = 1000000
value_array = np.arange(n, dtype=np.float)
value_array[n // 2] = np.nan
list_of_tuples = [(f"id{i}", f"name{i}", f"text{i}")
                  for i in range(len(value_array))]

# timings for suggested alternative method
t0 = time.time()
l = [{"id": t[0],
      "name": t[1],
      "text": t[2],
      "value": v}
     for t, v in zip(list_of_tuples, value_array) if not isnan(v)]
t1 = time.time()
print("Time:", t1 - t0)
print(l[0], l[-1], len(l))

# timings for the method in the question
t0 = time.time()
l = \
[
    dict(
        dict(zip(["id", "name", "text"], list_of_tuples[index])),
        **{"value": value},
    )
    for index, value in enumerate(value_array)
    if not (isnan(value))
]
t1 = time.time()
print("Time:", t1 - t0)
print(l[0], l[-1], len(l))

Also tried and rejected: create a boolean array of not isnan values, using

not_isnan_array = np.logical_not(np.isnan(value_array))

and then in the list comprehension you can do:

... for t, v, not_isnan in zip(list_of_tuples, value_array, not_isnan_array) if not_isnan

but it has almost no effect on the timing so does not justify the extra memory usage.

UPDATE

Further experimentation with hybrid versions (between the original in the question and the proposed alternative) shows that, as I suspected, most of the difference comes from the avoidance of creating a zip object on every iteration. The avoidance of explicitly indexing the list of tuples accounts for only a small part of the speedup.

Collectives™ on Stack Overflow

Performantly create a list of python dictionaries by combining a numpy array and a list of tuples

2 Answers 2

6 Comments

UPDATE

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

UPDATE

Comments

Your Answer

Sign up or log in

Post as a guest

Related