2

We have a sequence a1, a2, ..., an where ai are integer, |ai| <= 106 and n <= 106. We have m queries in the form: "i j" which mean "What is the median in sequence ai, ai+1, ..., aj?"

Do you have some idea how to do it? I know there is a algorithm to find median in sequence in linear time (https://en.wikipedia.org/wiki/Median_of_medians), but applying it in every query is too slow.

5
  • Hard to imagine that you could get a median in faster than linear time. Commented Aug 21, 2015 at 21:02
  • When your data is so large, would the median be very far away from the mean? Commented Aug 21, 2015 at 21:03
  • It's easy to imagine, since there's a lot of overlaps. If you have a_i....a_j and already have their median, does it still take linear time to find the median of a_i....a_j+1 ? or a_i+1....a_j ? Commented Aug 21, 2015 at 21:04
  • I can't help but think of a modified Fenwick Tree that stores medians (and the count of numbers involved) instead of prefixes Commented Aug 21, 2015 at 21:11
  • Do you need to compute the median of every subsequence? Or just particular subsequences? Can you characterize which subsequences you need to find the median for? Commented Aug 21, 2015 at 21:11

5 Answers 5

4

The problem is called Range Median Query. There are algorithms with varying complexities and properties, see this and this as starting points.

From Mark Gordon's answer on Quora:

Create a two dimensional orthogonal range tree created from the N points of the form (A[i], i). Constructing this tree can be done easily in O(N log^2 N) time (although O(N log N) is possible).

Now to query the kth element we traverse the first dimension of the tree. We follow the left subtree if the number of points within our query index range in the left subtree is smaller than k. This is simply a query on the second dimension tree of the left subtree. If the kth element isn't in the left subtree we adjust k appropriately and search in the right subtree. This whole search takes O(N log^2 N) time. Essentially we have dropped a log N factor from Johnny's solution by wrapping the binary search into the traversal of the tree.

It's actually possible to get this down to O(N log N) preprocessing and O(log N) per query. Skip to about 17:00 in 6.851: Advanced Data Structures (Spring'12) to see Erik Demaine explain orthogonal range trees and how to achieve the faster preprocessing and query times which take mild cleverness and fractional cascading respectively.

There are also a few research papers dedicated to the subjects if you'll search for the problem name. It's not an easy problem, and you'll probably need to do a bit of documentation to grasp the solutions. I'd start by watching the video linked in the Quora answer I quoted.

Unfortunately, I don't understand the subject matter well enough to explain it very well myself in this format. If someone does, feel free to edit this or post your own answer and I'll remove mine.

Sign up to request clarification or add additional context in comments.

2 Comments

(P.S. don't implement fractional cascading unless you're in it for the educational experience.)
It's actually O(log^3(N)) per query because the question here is the inverse of what's described in the video (you can imagine you're given the x-values and you want to pick y-values for the box so as to get a desired number of points), thus we need to binary search the solution where each function call is itself O(log^2N) (2 dimensions, x and y)
0

You can probably find the median faster, but you have to make sure it's worth it.

First, notice you need to find O(n^2) medians - of a_i....a_j of all j>i . Your method of calculating the median every time has a time complexity of O(N^3).

You can improve that once you figure out you don't need O(N) of time to find the median of a_i....a_j if you already know the median of a_i....a_j-1.

Turns out that by using Orderd Statistics Trees you can find the median in O(log N). So, to find the median of a_1...a_j *for all 1

Start building the tree with a_1. Find the median. Add a_2. Find the median. Add a_3. Find the median, and so on until you add a_n. And find the median.

Each operation here takes O(logN) and you have 2*N operations, so you can find all the medians of a_1...a_j in O(NlogN), as compared to O(N^2) in your original suggestion.

Do that for all starting points (start with a_2, then a_3 and so on), and you can find all O(N^2) medians in O(N^2 * log N)

Now, is O(N^2 * log N), plus O(N^2) of memory (to remember all the medians) better than what you already have? That depends on m. If m>>N, it might be worth it.

2 Comments

Unfortunately, with n up to 10^6, O(n^2) memory is definitely not feasible.
It is if you have a big disk.
0

A practical quick-and-dirty way of doing this is something like:

int median(const std::vector<int>& a, int i, int j) {
  std::vector<int> subsequence(a.begin() + i, a.begin() + j);
  std::sort(subsequence.begin(), subsequence.end());
  return subsequence[subsequence.size() / 2];
}

i.e. copy the subsequence, sort it, and return the centre element.

Comments

0

Perform a mergesort and create a binary tree where each layer is a stage of the mergesort as a collection of sorted arrays.

eg, suppose our range we want to make queries over is: [3, 2, 5, 5, 9, 6, 10, 1]

Then our tree would look like:

[1,  2,  3,  5,  5,  6,  9, 10]
       /               \
[2,  3,  5,  5] [1,  6,  9, 10]
   /       \       /       \
[2,  3] [5,  5] [6,  9] [1, 10]
  / \     / \     / \     / \
[3] [2] [5] [5] [6] [9][10] [1]

This requires O(NlogN) space to store and takes O(NlogN) time to generate (just mergesort the elements and store the result of each stage of the sort). Additionally, we want to store the endpoints of the segments represented by each internal node (this can also be computed on the fly by looking at the length of each array, but it's easier to explain if we just imagine that we store these values in the tree together with the arrays).

                     (1,8)
                 /           \
         (1,4)                   (5,8)
       /       \               /       \
   (1,2)       (3,4)       (5,6)       (7,8)
   /   \       /   \       /   \       /   \
(1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8)

Now we can ask for any two ranges:

  1. Does s cover t (x_s <= x_t and y_s >= y_t) b)
  2. Does s intersect t (x_s <= y_t and y_s >= x_t)

Now, suppose we're given the bounds b=(3,7) and we want to find a minimal set of nodes which cover this range.

In other words, we want to obtain: (3,4) (5,6) (7,7)

Note this list of bounds will always have O(logN) length since for any number, P, there can be at most two items which have length P and all the lengths are powers of 2.

To find this set in O(logN) time, we start from the root node and traverse the tree at each node asking:

  1. Does b overlap this range? If yes, include the range in the list and stop.
  2. Does b intersect this range? If yes, recurse to each of its children.
  3. Otherwise, stop and don't add anything to the list.

Once we've obtained the arrays corresponding to these ranges (in the 3,7 example, the corresponding arrays are [5,5],[6,9],[10]) we can now answer the following question in O(log^2N) time by binary searching each one (there are O(logN) of these sorted arrays and they each have length O(N) in the worst case) and then summing up the indices (quick note; this is the only case I've ever seen where it makes sense to add indices into different arrays together to get a sum):

Given some x, how many elements among all the arrays are less than x?

We binary search each array for x, and the resulting index is the number of elements less than x.

Now, to find the median of some range, (s, t), we are asking:

What is the element prior to the smallest element x with the property that (t-s+1)/2 elements in between indices s and t are less than x?

To answer this, we can binary search the original sorted list. Since each lookup in our binary search requires O(log^2N) time, the total time to answer the query is O(log^3N)

Here's the pseudocode:

fn create_tree(elems, lb, ub):
  if len(elems) == 1:
     return leaf((lb, ub), elems)
  mid = midpt(lb, ub)
  (left, right) = splitInHalf(elems)
  tleft = create_tree(left, lb, midpt-1)
  tright = create_tree(right, midpt, ub)
  return internal(merge(tleft.elems, tright.elems), lb, ub, tleft, tright)

fn get_cover_arrays(tree,bnds):
  if covers(bnds,tree.bnds):
    return [tree.elems]
  else if intersects(bnds,tree.bnds):
    return get_cover_arrays(tree.left,bnds) `concat` get_cover_arrays(tree.right,bnds)
  else:
    return []

fn numElemsLT(cover_arrays, x):
  return sum(idx=binary_search(arr,x) for arr in cover_arrays)

fn getKthInRange(tree, bnds, k):
  cover_arrays = get_cover_arrays(tree, bnds)
  all_elems = tree.elems
  next_idx = binary_search(lambda i: numElemsLT(cover_arrays, all_elems[i]), k)
  return all_elems[next_idx - 1]

fn getMedian(tree, (lb,ub)):
  return getKthInRange(tree, (lb,ub), (ub-lb+1) / 2)

Comments

-1

This problem can be solved by persistent segment trees. Basically, build a frequency array for each segment (1,r) and then you can use range sum thingys to go either left or right in your seg tree.

It seems like you need O(N^2) space, but you can see that tree of (1,x) and (1,x+1) have only O(lg n) different nodes. So you can build a persistent structure Memory : O(N lg n) Time : O(N lg N) preprocessing, O(lg N) per query

See problem MKTHNUM

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.