Answering Subarray H-index Queries Efficiently

Question

I'm working on an interesting question about answering range h-index queries.

Fyi h-index is the max integer h that an author has published at least h papers that each are cited at least h times.

You are given N papers and the i-th paper has been cited p[i] times. There will be Q questions, each giving a range [l, r] and asks what would be the h-index of an author that only publishes the papers from index l to r inclusive. N, Q, and all p[i] are between 1 and 2*10^5.

Here is a example input

The output should be

2
1

Here p=[5, 7, 4, 1], a query on range [2, 3] should give 2 because the 2 papers in that range have each been cited at least twice (7 and 4 times specifically). Query on [3, 4] should give 1.

Doing binary search on the subarray of each query is O(QNlogN). This code is my attempt at solving the problem. Unfortunately, it gives the right answers but it exceeds the 3 second time limit. The time limit is for the execution of the entire program, so this includes the time to read input. I did also try with scanf and cin.tie(0);ios::sync_with_stdio(0); to speed up i/o but that still exceeds the time limit.

#include <iostream>
int p[200005];
int main() {
    using namespace std;
    int n;
    cin >> n;
    int q;
    cin >> q;
    for (int i = 0; i < n; i++) {
        cin >> p[i + 1];
    }
    for (int i = 0; i < q; i++) {
        int l;
        cin >> l;
        int r;
        cin >> r;
        int lo = 1, hi = n;
        while (lo <= hi) {
            int mid = (lo + hi) / 2;
            int cnt = 0;
            for (int j = l; j <= r; j++) {
                if (p[j] >= mid) {
                    cnt++;
                }
            }
            if (cnt >= mid) {
                lo = mid + 1;
            } else {
                hi = mid - 1;
            }
        }
        cout << hi << '\n';
    }
}

My other idea was using a segment tree but idk how to do the merging of two segment tree nodes faster than linear time. The problem here is still mainly the time complexity, I'm looking for a faster algorithm and based on the constraints I believe the complexity per query should definitely be better than O(N).

I'm stumped, what is a faster way to solve this within the time limit?

Aside: if you're using C++, why are you rolling your own binary search rather than just using std::binary_search from <algorithm>? std::vector is also likely a better option that you C-style array as a global. — Chris
– Chris, Commented Aug 11 at 19:59
@Chris I don't see how std::binary_search would help me find the h-index given it returns true or false? I'm binary searching on what the h-index is, I'm not looking for an element in a range — Isabella
– Isabella, Commented Aug 11 at 20:11
binary_search has a sibling lower_bound that returns a position, instead of a yes/no. — BoP
– BoP, Commented Aug 11 at 20:26
@BoP I'd considered that but I didn't get how I could search over a range in a clean way without having an actual container of all the elements in that range. in my case, I'm looking for the last element that a particular function (checking if the h-index is at least mid) returns true for without actually computing the function for every integer in the range (which would make it much slower) — Isabella
– Isabella, Commented Aug 11 at 20:34

Unmitigated · Accepted Answer · 2025-08-13 17:34:07Z

Mo's algorithm can solve this problem offline in O(QlogQ + N√Q). This time complexity is not the best, but the advantage of this approach lies in the simplicity of its implementation.

First, split the array into √Q blocks of size N/√Q. Then, sort all the queries in increasing order of the block number of the left endpoint, then in increasing order of the right endpoint (for queries whose left indexes belong in the same block).

Instead of merging answers for multiple ranges, we can maintain a current subarray and its answer, which starts out empty. While iterating over the queries, we just need to transform the subarray from the previous query into the subarray for the next query by adding and removing elements from the ends.

In order to efficiently maintain the answer with each addition or deletion of an element, we keep track of how many papers there are with each citation count as well as how many papers have been cited more times than the current h-index in the current subarray. To add an element, notice that the h-index will either stay the same or increase by 1 if the number of papers that have been cited more than the current h-index is greater than the current h-index. Removing an element is similar: the answer decreases by at most 1. Thus, adding or removing a single element can be done in O(1).

To understand the time complexity, let's now consider how often we move the left and right index of the current subarray. The left index of adjacent queries after sorting differ by at most the block size, so the left index moves at most O(Q * N/√Q) = O(N√Q) times in total. For each block, the queries are sorted in increasing order of right index, so the right index moves O(N) times for each block, which is O(N * √Q) in total. This shows that we require O(N√Q) addition and deletion operations; since each operation is done in constant time, the time complexity for answering the queries after O(QlogQ) sorting is O(N√Q). Finally, we only need to output the answers in the correct order based on the original index of each query.

#include <iostream>
#include <vector>
#include <algorithm>
#include <utility>
#include <cmath>
int main() {
    struct Query {
        int index, left, right;
    };
    int N, Q;
    std::cin >> N >> Q;
    // just for demonstration; it's often better to use a fixed constant block size
    const int blockSize = std::max(N / std::sqrt(Q), 1.);
    std::vector<int> citations(1); // make it 1-indexed
    for (int i = 0, c; i < N; ++i) {
        std::cin >> c;
        citations.push_back(std::min(c, N));
    }
    std::vector<Query> queries;
    std::vector<int> answers(Q), count(N + 1);
    for (int i = 0, left, right; i < Q; ++i) {
        std::cin >> left >> right;
        queries.emplace_back(i, left, right);
    }
    std::ranges::sort(queries, {}, [&](const auto& query) {
        return std::pair(query.left / blockSize, query.right);
    });
    int hIndex = 0, greater = 0;
    auto addPaper = [&](int index) {
        ++count[citations[index]];
        greater += citations[index] > hIndex;
        if (greater > hIndex) {
            ++hIndex;
            greater -= count[hIndex];
        }
    };
    auto removePaper = [&](int index) {
        --count[citations[index]];
        greater -= citations[index] > hIndex;
        if (count[hIndex] + greater < hIndex) {
            greater += count[hIndex];
            --hIndex;
        }
    };
    int currLeft = 1, currRight = 0;
    for (const auto& query : queries) {
        while (currRight < query.right) addPaper(++currRight);
        while (currRight > query.right) removePaper(currRight--);
        while (currLeft > query.left) addPaper(--currLeft);
        while (currLeft < query.left) removePaper(currLeft++);
        answers[query.index] = hIndex;
    }
    for (int answer : answers) std::cout << answer << '\n';
}

Alternatively, this problem can be solved in O((N + Q)logN) with a persistent segment tree (used in a similar way as in my answer to Efficient way to find sum of largest x elements in a subarray). First, replace all citation counts greater than N with N as the h-index cannot be greater than the number of papers. Then, we can build a persistent segment tree over the range 1 to N in O(NlogN), where each node stores the frequency of papers in a certain prefix of the array with a citation count that falls within its segment. Each query on a subarray [l, r] can be answered in O(logN) by walking down the segment tree starting from the roots for the prefix at r and l - 1.

ty!! is persistent segment tree also offline and do you have more details on writing the code for it?

Brian Bi · Accepted Answer · 2025-08-12 04:33:55Z

3

I can do it in O(log² N) per query. I keep thinking there must be a way to do O(log N) per query, but I haven't figured it out yet.

Here's the idea: build a segment tree where each node holds the sorted list of values that appear in its interval, where each such value v is accompanied by

the total frequency of values in the interval that are greater than or equal to v;
for a non-leaf node, the index (in the left node's sorted list) of the smallest value in its left child node that is greater than or equal to v, and same for the right node.

This is a data structure of total size O(N log N).

Now, given a query interval we can find its h-index using binary search: for each potential value h, we will check if there are at least h values in the interval that are at least h, and depending on the answer to that subproblem we'll discard half of the possible values of the h-index. To find the answer to this subproblem, we look up h in the root list and walk down the tree to the appropriate nodes as in any other segment tree query. When we descend from a node to its child, we use the index values to quickly jump from the smallest value that is greater than or equal to h in the parent node to the smallest value that is greater than or equal to h in the child node. (This is an example of fractional cascading.) Summing the corresponding frequencies across the basic intervals whose union is the query interval gives the number of values in the query interval that are at least h.

The binary search for each query has O(log N) iterations, and each iteration takes O(log N) time because it involves walking down the tree to the basic intervals, which visits at most two nodes per level of the tree.

answered Aug 12 at 4:33

Brian Bi

123k13 gold badges197 silver badges353 bronze badges

2 Comments

Isabella Aug 14 at 19:52

ty!! did you come up with this data structure or are there any resources about it?

Brian Bi Aug 15 at 0:48

The Wikipedia article on fractional cascading describes the general case where the lists of values form a graph, of which a tree is a special case.

Collectives™ on Stack Overflow

Answering Subarray H-index Queries Efficiently

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related