0

I have a database table called 'do_not_call', which contains information about files that hold a range of 10 digit phone numbers in the increasing order. The column 'filename' holds the name of file that contain the range of numbers from 'first_phone' to 'last_phone'. There are about 2500 records in 'do_not_call' table.Database query result of do_not_call

And I have a list of sqlalchemy records. I need to find which file is holding the 'phone' field of these records. So I have created a function which takes in the sqlalchemy records and returns a dictionary where the key is the name of file and value is a list of phone numbers from the sqlalchemy records that falls in the range of first and last phone numbers, contained in the file.

def get_file_mappings(dbcursor, import_records):
    start_time = datetime.now()
    phone_list = [int(rec.phone) for rec in import_records]
    dnc_sql = "SELECT * from do_not_call;"
    dbcursor.execute(dnc_sql)
    dnc_result = dbcursor.fetchall()
    file_mappings = {}
    for file_info in dnc_result:
        first_phone = int(file_info.get('first_phone'))
        last_phone = int(file_info.get('last_phone'))
        phone_ranges = list(filter(lambda phone: phone in range(first_phone, last_phone), phone_list))
        if phone_ranges:
            file_mappings.update({file_info.get('filename'): phone_ranges})
            phone_list = list(set(phone_list) - set(phone_ranges))
    # print(file_mappings)
    print("Time = ", datetime.now() - start_time)
    return file_mappings

For example if the phone_list is [2023143300, 2024393100, 2027981539, 2022760321, 2026416368, 2027585911], the file_mappings returned will be

    {'1500000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2023143300, 2022760321],
     '1700000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2024393100], 
'1900000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2027981539, 2026416368, 2027585911]}

The problem here is that it takes a lot of time to execute. On average it takes about 1.5 seconds for 1000 records. Is there a better approach/algorithm to solve this problem. Any help is appreciated.

5
  • list(filter(lambda phone: phone in range(first_phone, last_phone), phone_list)) is more idiomatic (and likely faster) as [phone for phone in phone_list if phone in range(first_phone, last_phone)]. Commented Sep 28, 2020 at 15:26
  • @Carcigenicate yes, that shaved off a 0.5 seconds for 1000 records Commented Sep 28, 2020 at 15:30
  • Ya, I wasn't expecting a huge gain. It will be a little faster though because producing a lazy collection (via filter), then forcing it into a list has always been slower for me. Also, lambda use seems to be slow. list(set(phone_list) - set(phone_ranges)) is also probably going to be much faster as phone_range_set = set(phone_ranges); phone_list = [phone for phone in phone_list if phone not in phone_range_set] Creating two sets just to do subtraction seems needlessly expensive. Commented Sep 28, 2020 at 15:35
  • Double check the output of those because I'm just eyeballing them, but they should be equivalent. Commented Sep 28, 2020 at 15:37
  • What is the typical size of phone_list? And it seems your result is somewhat backward from what I would expect... I would think you would want key=phone, value=file ? Commented Sep 28, 2020 at 16:25

1 Answer 1

1

This is a very inefficient approach to binning things into a sorted list. You are not taking advantage of the fact that your bins are sorted (or could easily be sorted if they were not.) You are making a big nested loop here by testing phone numbers with the lambda statement.

You could make some marginal improvements by being consistent with set use (see below.) But in the end, you could/should just find each phone's place in the listing with an efficient search, like bisection. See example below with timing of original, set implementation, and bisection insertion.

If your phone_list is just massive, then other approaches may be advantageous, such as finding where the cutoff bins fit into a sorted copy of the phone list... but this below is 500x faster than what you have now for 1,000 or 10,000 records

# phone sorter
import random
import bisect
import time
from collections import defaultdict

# make some fake data of representative size
low_phone = 200_000_0000
data = []   # [file, low_phone, high_phone]
for idx in range(2500):
    row = []
    row.append(f'file_{idx}')
    row.append(low_phone + idx * 20000000)
    row.append(low_phone + (idx + 1) * 20000000 - 20)  # some gap
    data.append(row)

high_phone = data[-1][-1]

# generate some random phone numbers in range
num_phones = 10000
phone_list_orig = [random.randint(low_phone, high_phone) for t in range(num_phones)]

# orig method...
phone_list = phone_list_orig[:]
tic = time.time()
results = {}
for row in data:
    low = row[1]
    high = row[2]
    phone_ranges = list(filter(lambda phone: phone in range(low, high), phone_list))
    if phone_ranges:
        results.update({row[0]:phone_ranges})
        phone_list = list(set(phone_list) - set(phone_ranges))
toc = time.time()
print(f'orig time: {toc-tic:.3f}')

# with sets across the board...
phone_list = set(phone_list_orig)
tic = time.time()
results2 = {}
for row in data:
    low = row[1]
    high = row[2]
    phone_ranges = set(filter(lambda phone: phone in range(low, high), phone_list))
    if phone_ranges:
        results2.update({row[0]:phone_ranges})
        phone_list = phone_list - phone_ranges
toc = time.time()
print(f'using sets time: {toc-tic:.3f}')

# using bisection search
phone_list = set(phone_list_orig)
tic = time.time()
results3 = defaultdict(list)
lows = [t[1] for t in data]
for phone in phone_list:
    location = bisect.bisect(lows, phone) - 1
    if phone <= data[location][2]:  # it is within the high limit of bin
        results3[data[location][0]].append(phone)
toc = time.time()
print(f'using bisection sort time: {toc-tic:.3f}')

# for k in sorted(results3):
#   print(k, ':', results.get(k))

assert(results==results2==results3)

results:

orig time: 5.236
using sets time: 4.597
using bisection sort time: 0.012
[Finished in 9.9s]
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer. The binary search method was fast. What if the phone_list is massive? I'm developing taking into consideration a few thousand records, but in production there might be millions of records.
Great. If the answer was helpful, you could accept it with the checkmark. If the phone list gets massive.... If there is a way to keep the phone list sorted (or sort it offline or such) then you might consider reversing the insertion sort direction. Meaning, sort the bin boundaries from the files into the list of phones and use those indices to slice or such. Or you could do it in batches or such. Couple things to tinker with

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.