Algorithm - many queries about median in sequence

Question

We have a sequence a₁, a₂, ..., a_n where a_i are integer, |a_i| <= 10⁶ and n <= 10⁶. We have m queries in the form: "i j" which mean "What is the median in sequence a_i, a_i+1, ..., a_j?"

Do you have some idea how to do it? I know there is a algorithm to find median in sequence in linear time (https://en.wikipedia.org/wiki/Median_of_medians), but applying it in every query is too slow.

Hard to imagine that you could get a median in faster than linear time. — john
– john, Commented Aug 21, 2015 at 21:02
When your data is so large, would the median be very far away from the mean? — user3528438
– user3528438, Commented Aug 21, 2015 at 21:03
It's easy to imagine, since there's a lot of overlaps. If you have a_i....a_j and already have their median, does it still take linear time to find the median of a_i....a_j+1 ? or a_i+1....a_j ? — zmbq
– zmbq, Commented Aug 21, 2015 at 21:04
I can't help but think of a modified Fenwick Tree that stores medians (and the count of numbers involved) instead of prefixes — AndyG
– AndyG, Commented Aug 21, 2015 at 21:11
Do you need to compute the median of every subsequence? Or just particular subsequences? Can you characterize which subsequences you need to find the median for? — Jerry Federspiel
– Jerry Federspiel, Commented Aug 21, 2015 at 21:11

Community · Accepted Answer · 2020-06-20 09:12:55Z

4

The problem is called Range Median Query. There are algorithms with varying complexities and properties, see this and this as starting points.

From Mark Gordon's answer on Quora:

Create a two dimensional orthogonal range tree created from the N points of the form (A[i], i). Constructing this tree can be done easily in O(N log^2 N) time (although O(N log N) is possible).

Now to query the kth element we traverse the first dimension of the tree. We follow the left subtree if the number of points within our query index range in the left subtree is smaller than k. This is simply a query on the second dimension tree of the left subtree. If the kth element isn't in the left subtree we adjust k appropriately and search in the right subtree. This whole search takes O(N log^2 N) time. Essentially we have dropped a log N factor from Johnny's solution by wrapping the binary search into the traversal of the tree.

It's actually possible to get this down to O(N log N) preprocessing and O(log N) per query. Skip to about 17:00 in 6.851: Advanced Data Structures (Spring'12) to see Erik Demaine explain orthogonal range trees and how to achieve the faster preprocessing and query times which take mild cleverness and fractional cascading respectively.

There are also a few research papers dedicated to the subjects if you'll search for the problem name. It's not an easy problem, and you'll probably need to do a bit of documentation to grasp the solutions. I'd start by watching the video linked in the Quora answer I quoted.

Unfortunately, I don't understand the subject matter well enough to explain it very well myself in this format. If someone does, feel free to edit this or post your own answer and I'll remove mine.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Aug 21, 2015 at 21:12

IVlad

43.6k13 gold badges115 silver badges184 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

David Eisenstat Over a year ago

(P.S. don't implement fractional cascading unless you're in it for the educational experience.)

dspyz Over a year ago

It's actually O(log^3(N)) per query because the question here is the inverse of what's described in the video (you can imagine you're given the x-values and you want to pick y-values for the box so as to get a desired number of points), thus we need to binary search the solution where each function call is itself O(log^2N) (2 dimensions, x and y)

zmbq · Accepted Answer · 2015-08-21 21:15:50Z

0

You can probably find the median faster, but you have to make sure it's worth it.

First, notice you need to find O(n^2) medians - of a_i....a_j of all j>i . Your method of calculating the median every time has a time complexity of O(N^3).

You can improve that once you figure out you don't need O(N) of time to find the median of a_i....a_j if you already know the median of a_i....a_j-1.

Turns out that by using Orderd Statistics Trees you can find the median in O(log N). So, to find the median of a_1...a_j *for all 1

Start building the tree with a_1. Find the median. Add a_2. Find the median. Add a_3. Find the median, and so on until you add a_n. And find the median.

Each operation here takes O(logN) and you have 2*N operations, so you can find all the medians of a_1...a_j in O(NlogN), as compared to O(N^2) in your original suggestion.

Do that for all starting points (start with a_2, then a_3 and so on), and you can find all O(N^2) medians in O(N^2 * log N)

Now, is O(N^2 * log N), plus O(N^2) of memory (to remember all the medians) better than what you already have? That depends on m. If m>>N, it might be worth it.

answered Aug 21, 2015 at 21:15

zmbq

39.1k15 gold badges109 silver badges189 bronze badges

2 Comments

IVlad Over a year ago

Unfortunately, with n up to 10^6, O(n^2) memory is definitely not feasible.

zmbq Over a year ago

It is if you have a big disk.

Eamonn O'Brien-Strain · Accepted Answer · 2015-08-22 06:05:04Z

0

A practical quick-and-dirty way of doing this is something like:

int median(const std::vector<int>& a, int i, int j) {
  std::vector<int> subsequence(a.begin() + i, a.begin() + j);
  std::sort(subsequence.begin(), subsequence.end());
  return subsequence[subsequence.size() / 2];
}

i.e. copy the subsequence, sort it, and return the centre element.

answered Aug 22, 2015 at 6:05

Eamonn O'Brien-Strain

3,4121 gold badge26 silver badges33 bronze badges

Comments

dspyz · Accepted Answer · 2016-03-16 21:08:18Z

Perform a mergesort and create a binary tree where each layer is a stage of the mergesort as a collection of sorted arrays.

eg, suppose our range we want to make queries over is: [3, 2, 5, 5, 9, 6, 10, 1]

Then our tree would look like:

[1,  2,  3,  5,  5,  6,  9, 10]
       /               \
[2,  3,  5,  5] [1,  6,  9, 10]
   /       \       /       \
[2,  3] [5,  5] [6,  9] [1, 10]
  / \     / \     / \     / \
[3] [2] [5] [5] [6] [9][10] [1]

This requires O(NlogN) space to store and takes O(NlogN) time to generate (just mergesort the elements and store the result of each stage of the sort). Additionally, we want to store the endpoints of the segments represented by each internal node (this can also be computed on the fly by looking at the length of each array, but it's easier to explain if we just imagine that we store these values in the tree together with the arrays).

                     (1,8)
                 /           \
         (1,4)                   (5,8)
       /       \               /       \
   (1,2)       (3,4)       (5,6)       (7,8)
   /   \       /   \       /   \       /   \
(1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8)

Now we can ask for any two ranges:

Does s cover t (x_s <= x_t and y_s >= y_t) b)
Does s intersect t (x_s <= y_t and y_s >= x_t)

Now, suppose we're given the bounds b=(3,7) and we want to find a minimal set of nodes which cover this range.

In other words, we want to obtain: (3,4) (5,6) (7,7)

Note this list of bounds will always have O(logN) length since for any number, P, there can be at most two items which have length P and all the lengths are powers of 2.

To find this set in O(logN) time, we start from the root node and traverse the tree at each node asking:

Does b overlap this range? If yes, include the range in the list and stop.
Does b intersect this range? If yes, recurse to each of its children.
Otherwise, stop and don't add anything to the list.

Once we've obtained the arrays corresponding to these ranges (in the 3,7 example, the corresponding arrays are [5,5],[6,9],[10]) we can now answer the following question in O(log^2N) time by binary searching each one (there are O(logN) of these sorted arrays and they each have length O(N) in the worst case) and then summing up the indices (quick note; this is the only case I've ever seen where it makes sense to add indices into different arrays together to get a sum):

Given some x, how many elements among all the arrays are less than x?

We binary search each array for x, and the resulting index is the number of elements less than x.

Now, to find the median of some range, (s, t), we are asking:

What is the element prior to the smallest element x with the property that (t-s+1)/2 elements in between indices s and t are less than x?

To answer this, we can binary search the original sorted list. Since each lookup in our binary search requires O(log^2N) time, the total time to answer the query is O(log^3N)

Here's the pseudocode:

fn create_tree(elems, lb, ub):
  if len(elems) == 1:
     return leaf((lb, ub), elems)
  mid = midpt(lb, ub)
  (left, right) = splitInHalf(elems)
  tleft = create_tree(left, lb, midpt-1)
  tright = create_tree(right, midpt, ub)
  return internal(merge(tleft.elems, tright.elems), lb, ub, tleft, tright)

fn get_cover_arrays(tree,bnds):
  if covers(bnds,tree.bnds):
    return [tree.elems]
  else if intersects(bnds,tree.bnds):
    return get_cover_arrays(tree.left,bnds) `concat` get_cover_arrays(tree.right,bnds)
  else:
    return []

fn numElemsLT(cover_arrays, x):
  return sum(idx=binary_search(arr,x) for arr in cover_arrays)

fn getKthInRange(tree, bnds, k):
  cover_arrays = get_cover_arrays(tree, bnds)
  all_elems = tree.elems
  next_idx = binary_search(lambda i: numElemsLT(cover_arrays, all_elems[i]), k)
  return all_elems[next_idx - 1]

fn getMedian(tree, (lb,ub)):
  return getKthInRange(tree, (lb,ub), (ub-lb+1) / 2)

evil999man · Accepted Answer · 2015-08-22 02:48:23Z

-1

This problem can be solved by persistent segment trees. Basically, build a frequency array for each segment (1,r) and then you can use range sum thingys to go either left or right in your seg tree.

It seems like you need O(N^2) space, but you can see that tree of (1,x) and (1,x+1) have only O(lg n) different nodes. So you can build a persistent structure Memory : O(N lg n) Time : O(N lg N) preprocessing, O(lg N) per query

See problem MKTHNUM

edited Aug 22, 2015 at 2:48

answered Aug 22, 2015 at 2:37

evil999man

2121 silver badge10 bronze badges

Collectives™ on Stack Overflow

Algorithm - many queries about median in sequence

5 Answers 5

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related