Python - Removing overlapping lists

Question

Say I have a list of lists that has indexes [[start, end], [start1, end1], [start2, end2]].

Like for example :

[[0, 133], [78, 100], [25, 30]].

How would I get check for overlap between among the lists and remove the list with the longer length each time? So:

>>> list = [[0, 133], [78, 100], [25, 30]]
>>> foo(list)
[[78, 100], [25, 30]]

This is what I tried to do so far:

def cleanup_list(list):
    i = 0
    c = 0
    x = list[:]
    end = len(x)
    while i < end-1:
        for n in range(x[i][0], x[i][1]):
            if n in range(x[i+1][0], x[i+1][1]):
                list.remove(max(x[i], x[i+1]))
        i +=1
    return list

But in addition to being kind of convoluted it's not working properly:

 >>>cleanup_list([[0,100],[9,10],[12,90]])
 [[0, 100], [12, 90]]

Any help would be appreciated!

EDIT:

Other examples would be:

>>>a = [[0, 100], [4, 20], [30, 35], [30, 78]]
>>>foo(a)
[[4, 20], [30, 35]]

>>>b = [[30, 70], [25, 40]]
>>>foo(b)
[[25, 40]]

I'm basically trying to remove all of the longest lists that overlap with another list. In this case I don't have to worry about the lists being of equal length.

Thanks!!

([[0,100],[9,10],[12,90]]) should go to [[0,100]] correct? — HennyH
– HennyH, Commented May 1, 2013 at 4:49
I fear this will be an ill-defined problem in case of three or more overlaps — wim
– wim, Commented May 1, 2013 at 4:50
I'm trying to actually remove [0, 100] and get [[9, 10], [12, 90]] — user2338068
– user2338068, Commented May 1, 2013 at 4:50
@user2338068: Can you show us some sample input / output combinations — Ayush
– Ayush, Commented May 1, 2013 at 4:55

jfs · Accepted Answer · 2013-05-01 23:34:20Z

To remove a minimal number of intervals from the list such that the intervals that are left do not overlap, O(n*log n) algorithm exists:

def maximize_nonoverlapping_count(intervals):
    # sort by the end-point
    L = sorted(intervals, key=lambda (start, end): (end, (end - start)),
               reverse=True) # O(n*logn)
    iv = build_interval_tree(intervals) # O(n*log n)
    result = []
    while L: # until there are intervals left to consider
        # pop the interval with the smallest end-point, keep it in the result
        result.append(L.pop()) # O(1)
        # remove intervals that overlap with the popped interval
        overlapping_intervals = iv.pop(result[-1]) # O(log n + m)
        remove(overlapping_intervals, from_=L) 
    return result

It should produce the following results:

f = maximize_nonoverlapping_count
assert f([[0, 133], [78, 100], [25, 30]]) == [[25, 30], [78, 100]]
assert f([[0,100],[9,10],[12,90]]) == [[9,10], [12, 90]]
assert f([[0, 100], [4, 20], [30, 35], [30, 78]]) == [[4, 20], [30, 35]]
assert f([[30, 70], [25, 40]]) == [[25, 40]]

It requires the data structure that can find in O(log n + m) time all intervals that overlap with the given interval e.g., IntervalTree. There are implementations that can be used from Python e.g., quicksect.py, see Fast interval intersection methodologies for the example usage.

Here's a quicksect-based O(n**2) implementation of the above algorithm:

from quicksect import IntervalNode

class Interval(object):
    def __init__(self, start, end):
        self.start = start
        self.end = end
        self.removed = False

def maximize_nonoverlapping_count(intervals):
    intervals = [Interval(start, end) for start, end in intervals]
    # sort by the end-point
    intervals.sort(key=lambda x: (x.end, (x.end - x.start)))   # O(n*log n)
    tree = build_interval_tree(intervals) # O(n*log n)
    result = []
    for smallest in intervals: # O(n) (without the loop body)
        # pop the interval with the smallest end-point, keep it in the result
        if smallest.removed:
            continue # skip removed nodes
        smallest.removed = True
        result.append([smallest.start, smallest.end]) # O(1)

        # remove (mark) intervals that overlap with the popped interval
        tree.intersect(smallest.start, smallest.end, # O(log n + m)
                       lambda x: setattr(x.other, 'removed', True))
    return result

def build_interval_tree(intervals):
    root = IntervalNode(intervals[0].start, intervals[0].end,
                        other=intervals[0])
    return reduce(lambda tree, x: tree.insert(x.start, x.end, other=x),
                  intervals[1:], root)

Note: the time complexity in the worst case is O(n**2) for this implementation because the intervals are only marked as removed e.g., imagine such input intervals that len(result) == len(intervals) / 3 and there were len(intervals) / 2 intervals that span the whole range then tree.intersect() would be called n/3 times and each call would execute x.other.removed = True at least n/2 times i.e., n*n/6 operations in total:

n = 6
intervals = [[0, 100], [0, 100], [0, 100], [0, 10], [10, 20], [15, 40]])
result = [[0, 10], [10, 20]]

Here's a banyan-based O(n log n) implementation:

from banyan import SortedSet, OverlappingIntervalsUpdator # pip install banyan

def maximize_nonoverlapping_count(intervals):
    # sort by the end-point O(n log n)
    sorted_intervals = SortedSet(intervals,
                                 key=lambda (start, end): (end, (end - start)))
    # build "interval" tree O(n log n)
    tree = SortedSet(intervals, updator=OverlappingIntervalsUpdator)
    result = []
    while sorted_intervals: # until there are intervals left to consider
        # pop the interval with the smallest end-point, keep it in the result
        result.append(sorted_intervals.pop()) # O(log n)

        # remove intervals that overlap with the popped interval
        overlapping_intervals = tree.overlap(result[-1]) # O(m log n)
        tree -= overlapping_intervals # O(m log n)
        sorted_intervals -= overlapping_intervals # O(m log n)
    return result

Note: this implementation considers [0, 10] and [10, 20] intervals to be overlapping:

f = maximize_nonoverlapping_count
assert f([[0, 100], [0, 10], [11, 20], [15, 40]]) == [[0, 10] ,[11, 20]]
assert f([[0, 100], [0, 10], [10, 20], [15, 40]]) == [[0, 10] ,[15, 40]]

sorted_intervals and tree can be merged:

from banyan import SortedSet, OverlappingIntervalsUpdator # pip install banyan

def maximize_nonoverlapping_count(intervals):
    # build "interval" tree sorted by the end-point O(n log n)
    tree = SortedSet(intervals, key=lambda (start, end): (end, (end - start)),
                     updator=OverlappingIntervalsUpdator)
    result = []
    while tree: # until there are intervals left to consider
        # pop the interval with the smallest end-point, keep it in the result
        result.append(tree.pop()) # O(log n)

        # remove intervals that overlap with the popped interval
        overlapping_intervals = tree.overlap(result[-1]) # O(m log n)
        tree -= overlapping_intervals # O(m log n)
    return result

Thanks so much! I'm new to coding so some of this stuff is a tad confusing - but I'll read through your links. But to clarify, is build_interval_tree a function that I'll have to create that's similar to the code in quicksect.py?
@user2338068: I've added a quicksect-based implementation, to show how it could look like (you can run it if you download quicksect.py and put it in the same directory as the script).
@user2338068: I've added banyan-based O(n log n) implementation (banyan is not pure Python (C++ is used to write C extension for Python) so it might make it harder to install).
A lot of this is over my head right now, but I'll go through it as best as I can. Thanks man, I really appreciate it.

AlessandroEmm · Accepted Answer · 2013-05-02 07:36:02Z

3

This may not be the fastest solution, but really verbose and clear - I think.

a = [[2,100], [4,10], [77,99], [38,39], [44,80], [69,70], [88, 90]]

# build ranges first
def expand(list):
    newList = []
    for r in list:
        newList.append(range(r[0], r[1] + 1))
    return newList


def compare(list):
    toBeDeleted = []
    for index1 in range(len(list)):
        for index2 in range(len(list)):
            if index1 == index2:
                # we dont want to compare ourselfs
                continue
            matches = [x for x in list[index1] if x in list[index2]]
            if len(matches) != 0: # do we have overlap?
                ## compare lengths and get rid of the longer one
                if   len(list[index1]) > len(list[index2]):
                    toBeDeleted.append(index1)
                    break
                elif len(list[index1]) < len(list[index2]):
                    toBeDeleted.append(index2)
    # distinct
    toBeDeleted = [ toBeDeleted[i] for i,x in enumerate(toBeDeleted) if x not in toBeDeleted[i+1:]] 
    print len(list)
    # remove items
    for i in toBeDeleted[::-1]:
        del list[i] 
    return list


print(compare(expand(a)))

edited May 2, 2013 at 7:36

answered May 1, 2013 at 8:42

AlessandroEmm

6987 silver badges23 bronze badges

3 Comments

jfs Over a year ago

This solution might delete too much i.e., it might return less non-overlapping intervals than possible to preserve. It is O(n**4) unnecessarily. It is not clear whether the distinct computation is correct, do you mean toBeDeleted = set(toBeDeleted) or something else? list is a builtin name, avoid using it as a variable.

AlessandroEmm Over a year ago

You are right: its not speedy (especially compared to yours) but it's more verbose and easier to maintain IMHO. toBeDeleted = set(toBeDeleted) does just what my for comprehension does, its just more obvious if you're not aware of set(). Maybe I got the problem wrong, but I haven't noticed any non-overlapping removals. Thanks for your opinion!

jfs Over a year ago

To be clear 1. This solution doesn't work: [[0, 10], [9, 12], [11, 20]] -> [[9, 10, 11, 12]] instead of [[0, 10], [11, 20]] 2. If it were working; it would be unnecessary slow e.g., even for a small input such as len(a) == 100, this solution requires ~100**4 (~100000000) operations. It is easy to make it both O(n**2) (~10000 operations) and more readable (though I find it debatable that my O(n log n) (~500 operations) version (especially banyan-based solution that uses single SortedSet) is less readable than the above O(n**4) solution).

Herrington Darkholme · Accepted Answer · 2013-05-01 06:08:11Z

2

I think one problem in your code is that it does not handle with the situation where one list contains another. For example, [0,100] contains [9,10]. When you loop n in [0,100] and n enters [9,10], the condition statement if n in range(x[i+1][0], x[i+1][1]) is triggered. Then the builtin function max will compare [0, 100] and [9, 10], and unfortuantely max will return [9,10] because it compare the first number in the list. Thus you remove the wrong element.

I'm trying another way to achieve the effect you want. Rather than manipulating the list itself, I create a new list. Conditionally appending new element to it if it meets our requirements.

def cleanup_list(lists):
    ranges = []
    for l in lists:
        to_insert = True
        for i in ranges:
            r = range(i[0],i[1])
            # if l overlaps with i, but l does not contain i
            if l[0] in r or l[1] in r:
                if (l[1]-l[0]) < len(r):
                    ranges.remove(i)
                else:
                    to_insert = False
            # l contains i
            if l[0]<i[0] and l[1]>i[1]:
                to_insert = False
        if to_insert:
            ranges.append(l)
    return ranges

answered May 1, 2013 at 6:08

Herrington Darkholme

6,3451 gold badge30 silver badges46 bronze badges

2 Comments

richselian Over a year ago

O(n^2) time complexity. slow.

Herrington Darkholme Over a year ago

Maybe this problem can be addressed by heap. Since all lists do not overlap, they can be represented by a binary heap, with start and end as value. This should optimize the time complexity to log(n)

richselian · Accepted Answer · 2013-05-01 06:07:22Z

1

ascending sort all items by length.
add them to a segment tree and ignore overlapped items.

answered May 1, 2013 at 6:07

richselian

7314 silver badges18 bronze badges

1 Comment

user2338068 Over a year ago

So would this be an optimized way to go about it? I'll look into it - thanks!

Joel Cornett · Accepted Answer · 2013-05-01 06:32:36Z

1

In general, two intervals are overlapping if:

min([upperBoundOfA, upperBoundOfB]) >= max([lowerBoundOfA, lowerBoundOfB])

If this is the case, the union of those intervals is:

(min([lowerBoundOfA, lowerBoundOfB]), max([upperBoundOfA, upperBoundOfB])

Similarly, the intersection of those intervals will be:

(min([upperBoundOfA, upperBoundOfB]), max([lowerBoundOfA, lowerBoundOfB]))

answered May 1, 2013 at 6:32

community wiki

Joel Cornett

1 Comment

user2338068 Over a year ago

I don't think finding the intersection or union helps in this particular case, but this is some very useful general advice I'll keep in mind. Thanks!

Collectives™ on Stack Overflow

Python - Removing overlapping lists

5 Answers 5

4 Comments

3 Comments

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

3 Comments

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related