First, one million elements is not that much on modern hardware, so using filtering / list could be pretty fast. Let's set up some test data first:
import random, bisect, operator
random.seed(0)
years = list(range(1950, 2003))
N = 1_000_000
employees = sorted([(i, random.choice(years)) for i in range(N)],
key=operator.itemgetter(1))
target_year = 1980
%timeit emp_1980 = [i for i, y in employees if y == target_year]
# 1 loop, best of 3: 261 ms per loop
# can query 4 years per second
You could use the builtin bisect with lists instead of scalars, but it will compare the first elements (IDs) by default, instead of years that we want. We can get around that with some preprocessing:
import bisect
# all (year, id) pairs sorted by year
year_id_sorted = [[y, i] for i, y in sorted(employees, key=operator.itemgetter(1))]
def ids_from_year(target_y):
result = list()
# make use of elementwise list ordering
i = bisect.bisect_left(year_id_sorted, [target_y, 0])
for y, ID in year_id_sorted[i:]:
if y == target_y:
result.append(ID)
else:
break
return result
%timeit ids_from_year(target_year)
# 100 loops, best of 3: 11.9 ms per loop
# can query 100 years per second
This yields a speedup by a factor of 26. Not too bad, but we've incurred some pre-processing cost and had to store a copy of the dataset. Now let's try storing employees in dictionary of the following form: "year => list of employees".
from collections import defaultdict
employee_by_year = defaultdict(list)
for i, y in employees:
employee_by_year[y].append(i)
# 1 loop, best of 3: 361 ms per loop
# pre-processing step is only %40 more expensive than
# a single lookup in the original case.
%timeit employee_by_year[target_year]
# 10000000 loops, best of 3: 63.2 ns per loop
# can query 16 million years per second
# other factors will likely limit processing speed
I suppose that the optimal implementation depends on how often you plan to run the query. Running it at least twice justifies using a dict. If you were using a less "granular" metric (e.g. salary), you could justify using a binary search. If you need to query multiple metrics, e.g. year and salary, you run into a memory vs. speed trade-off. Then, the best strategy really depends on your data distribution.