Binary search and return list from a 2D list in Python

Question

I am new to Python programming and I've been trying this question for a long time but I wasn't able to come up with a solution. The question is that I'm given a list of employee IDs and the birth year of the employees and that I'm supposed to return a list of employee IDs that was born in the input year. The following is an example of the list.

lst = [[2, 1987], [4, 1990], [6, 1992]...]

Linear searching requires too much time as there are 1 million employees on the list. Therefore, I think the question requires us to use a binary search. I am supposed to come up with a function that takes the input year and lst to come up with a list that returns the employee IDs. If there are no employees born in the year, it should return an empty list. The following is an example of the function.

input > get_IDs(2012)

output > [102, 204, 199632, 199649]

I understand that there's inbuilt bisect function which I was able to do for 1D arrays but I'm not sure how to do a binary search for 2D arrays. Please advice on what I should do and any suggestions are largely appreciated!

You can use the build-in filter function.

rrrttt
– rrrttt

2020-02-02 07:28:10 +00:00
Commented Feb 2, 2020 at 7:28 — rrrttt
– rrrttt, Commented Feb 2, 2020 at 7:28

hilberts_drinking_problem · Accepted Answer · 2020-02-02 09:04:57Z

First, one million elements is not that much on modern hardware, so using filtering / list could be pretty fast. Let's set up some test data first:

import random, bisect, operator
random.seed(0)

years = list(range(1950, 2003))
N = 1_000_000

employees = sorted([(i, random.choice(years)) for i in range(N)],
                  key=operator.itemgetter(1))

target_year = 1980
%timeit emp_1980 = [i for i, y in employees if y == target_year]
# 1 loop, best of 3: 261 ms per loop
# can query 4 years per second

You could use the builtin bisect with lists instead of scalars, but it will compare the first elements (IDs) by default, instead of years that we want. We can get around that with some preprocessing:

import bisect
# all (year, id) pairs sorted by year
year_id_sorted = [[y, i] for i, y in sorted(employees, key=operator.itemgetter(1))]

def ids_from_year(target_y):
  result = list()
  # make use of elementwise list ordering
  i = bisect.bisect_left(year_id_sorted, [target_y, 0])
  for y, ID in year_id_sorted[i:]:
    if y == target_y:
      result.append(ID)
    else:
      break
  return result

%timeit ids_from_year(target_year)
# 100 loops, best of 3: 11.9 ms per loop
# can query 100 years per second

This yields a speedup by a factor of 26. Not too bad, but we've incurred some pre-processing cost and had to store a copy of the dataset. Now let's try storing employees in dictionary of the following form: "year => list of employees".

from collections import defaultdict
employee_by_year = defaultdict(list)
for i, y in employees:
  employee_by_year[y].append(i)
# 1 loop, best of 3: 361 ms per loop
# pre-processing step is only %40 more expensive than
# a single lookup in the original case.

%timeit employee_by_year[target_year]
# 10000000 loops, best of 3: 63.2 ns per loop
# can query 16 million years per second
# other factors will likely limit processing speed

I suppose that the optimal implementation depends on how often you plan to run the query. Running it at least twice justifies using a dict. If you were using a less "granular" metric (e.g. salary), you could justify using a binary search. If you need to query multiple metrics, e.g. year and salary, you run into a memory vs. speed trade-off. Then, the best strategy really depends on your data distribution.

trincot · Accepted Answer · 2020-02-02 10:26:09Z

0

For binary search to work, you must have a sorted list. Sorting comes at a cost; it represents a O(nlogn) time complexity.

However, this can be done without binary search, with linear time complexity, if you use a dictionary, keyed by year:

from collections import defaultdict 

# preprocessing
dic = defaultdict(list) # silently create an empty list when addressing a year the first time
for id, year in lst:
    dic[year].append(id)

# implementation of the required function 
def get_IDs(year):
    return dic[year]

answered Feb 2, 2020 at 10:26

trincot

357k38 gold badges282 silver badges338 bronze badges

Collectives™ on Stack Overflow

Binary search and return list from a 2D list in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related