0

I am trying to find a time and memory performant way to create a dictionary, by combining both a list of tuples and an 1d numpy array. Hereby, each dictionary should have the following structure:

{"id": <first elem from tuple>, "name": <second elem from tuple>, "text": <third elem from tuple>, "value": <elem from array>}

The list of tuples looks as follows:

list_of_tuples = [("id1", "name1", "text1"), ("id2", "name2", "text2")]

Also, the numpy array has the same amount of elements as the list and contains elements of type np.float16:

value_array = np.ndarray([0.42, np.nan])

Also, all NaN-values are to be filtered out. The result of above example should be:

{"id": "id1", "name": "name1", "text": "text1", "value": 0.42}

I did get it to work like this:

[
    dict(
        dict(zip(["id", "name", "text"], list_of_tuples[index])),
        **{"value": value},
    )
    for index, value in enumerate(value_array)
    if not (math.isnan(value))
]

However, this is horribly slow for many entries and using the index to get entries from the list feels wrong/ inefficient.

2 Answers 2

1

You can definitely work without having to use the index explicitly. This should provide an improvement in performance.

value_array_indices = np.argwhere(~np.isnan(value_array)) 
list_of_tuples = np.array(list_of_tuples)[value_array_indices[0]]
value_array = value_array[value_array_indices[0]]
[{"id": x[0], "name": x[1], "text": x[2], "value": v} for x,v in zip(list_of_tuples, value_array)]
Sign up to request clarification or add additional context in comments.

6 Comments

grand, thank you! value_array_indices in your code gave me an array of 1-element arrays. accessing them with [0] then only gave me one element. that was easily fixed using: np.concatenate(np.argwhere(~np.isnan(value_array)), axis=0)
You forgot to filter out of the NaNs, but it is easily fixed by putting an if not math.isnan(v) clause in your list comprehension.
@alaniwi: value_array_indices only captures the indices of non-nan values.
@pecey Sorry yes, I see what you've done now - I didn't read it properly.
Since the last line involves iteration, I expect it would be faster if the zip arguments were lists. I also suspect that the if not math.isnan(v) clause would be faster than the argwhere approach. That may though depend on the proportion of nan. Since dict have to be created individually, numpy doesn't buy much, if anything.
|
1

Looks like someone posted a similar solution while I was writing this, but I'll post it anyway because of the timings measured, and some words of explanation.

Using the code suggested below, and with test inputs of length a million (which include a single NaN), I am seeing it go down to under 30% of the time compared to the code in the question.

Time: 0.3486933708190918
{'id': 'id0', 'name': 'name0', 'text': 'text0', 'value': 0.0} {'id': 'id999999', 'name': 'name999999', 'text': 'text999999', 'value': 999999.0} 999999
Time: 1.2175893783569336
{'id': 'id0', 'name': 'name0', 'text': 'text0', 'value': 0.0} {'id': 'id999999', 'name': 'name999999', 'text': 'text999999', 'value': 999999.0} 999999

I think the difference here is partly not having to index the list of tuples, but also I suspect that a large part of it is not having to instantiate a zip object for every element. You are dealing with a small number of dictionary keys with the same names each time, so you really do not need the flexibility that zip affords here, and it is more straightforward just to create the dictionary from a simple explicit expression.

(The zip(list_of_tuples, value_array) is obviously only creating one zip object for the whole operation, so is not significant.)

I've also suggested from math import isnan here rather than doing an attribute lookup to get math.isnan every time, although the difference turns out to be relatively unimportant.

from math import isnan
import numpy as np
import time

# construct some test data
n = 1000000
value_array = np.arange(n, dtype=np.float)
value_array[n // 2] = np.nan
list_of_tuples = [(f"id{i}", f"name{i}", f"text{i}")
                  for i in range(len(value_array))]

# timings for suggested alternative method
t0 = time.time()
l = [{"id": t[0],
      "name": t[1],
      "text": t[2],
      "value": v}
     for t, v in zip(list_of_tuples, value_array) if not isnan(v)]
t1 = time.time()
print("Time:", t1 - t0)
print(l[0], l[-1], len(l))

# timings for the method in the question
t0 = time.time()
l = \
[
    dict(
        dict(zip(["id", "name", "text"], list_of_tuples[index])),
        **{"value": value},
    )
    for index, value in enumerate(value_array)
    if not (isnan(value))
]
t1 = time.time()
print("Time:", t1 - t0)
print(l[0], l[-1], len(l))

Also tried and rejected: create a boolean array of not isnan values, using

not_isnan_array = np.logical_not(np.isnan(value_array))

and then in the list comprehension you can do:

... for t, v, not_isnan in zip(list_of_tuples, value_array, not_isnan_array) if not_isnan

but it has almost no effect on the timing so does not justify the extra memory usage.

UPDATE

Further experimentation with hybrid versions (between the original in the question and the proposed alternative) shows that, as I suspected, most of the difference comes from the avoidance of creating a zip object on every iteration. The avoidance of explicitly indexing the list of tuples accounts for only a small part of the speedup.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.