how I can further improve this code?
We see a lot of function calls there.
And since the input \$nums\$ is not
guaranteed to be sorted, you cannot
possibly accomplish the task in less
than \$O(n)\$ linear time.
Maintain a heap (a priority queue) of size \$k\$.
Linearly scan over \$nums\$, inserting into
the heap as you go.
Once the heap contains \$k\$ elements,
each insertion will evict one element
(possibly a tiny element we're attempting to insert).
The head of the heap will always contain
the max value encountered thus far.
Once the scan completes, just read the answer from the
end of the heap. It will be the smallest value
retained by the heap.
Did this take \$O(n \log n)\$ time overall?
No.
Why is that? Because each heap insert cost \$O(\log k)\$.
And given that \$k\$ is a constant, that makes
the cost \$O(1)\$ constant, as well; it is independent of the size of the input vector.
Doing that \$n\$ times gives \$O(n)\$ linear cost for the task.
(Actually, for a non-worst-case input, such
as \$nums\$ in random order, almost all insertion
attempts will be satisfied by comparing against
the end entry, letting us discard the currently
scanned item with \$O(1)\$ cost, just a single comparison.
Even something terrible, like Insertion Sort
on a bounded-size list, would perform adequately.
Worst case can be completely avoided with a \$O(n)\$
linear cost preprocessing step of a
Fisher-Yates
shuffle.)
PR process
Submissions to Code Review resemble
a Pull Review process, where a development team
tries to answer two questions about the code
before merging a feature branch down to main.
- Is it correct?
- Is it maintainable?
comments lie!
From the Review Context we have this,
which could essentially be a comment in source code:
I think my time complexity is 𝑂(𝑙𝑜𝑔 𝑛).
I am delighted the OP tried to assess the big-O cost.
However the problem statement makes it apparent that,
absent any specified structure or ordering,
a solution will have to inspect \$n\$ elements,
giving at least \$O(n)\$ time complexity.
It's not like \$nums\$ is a bit vector with
at most \$m\$ \$1\$'s, which lets a counting
approach exit early, or is an int vector with some ordering.
The clear error in the proposed cost makes me
skeptical about what other errors may be in the submission,
and less interested in working out its details.
This is a sad but true aspect of reviews:
a reviewer will tend to pick out the top
few items before even getting to subtler points.
When submitting code we strive to tidy up
such items in the hopes the reviewer will then
delve deeper to offer better advice.
citation
Martin R comments:
The question asks about finding the k-th largest element with Quickselect, which has an average complexitly of O(n). Can you elaborate why you recommend to use a heap-based method instead?
It is always helpful to cite a reference when implementing
an algorithm, as then we will have a Specification to
evaluate the code against.
The OP declined to do so, but let's imagine that
wikipedia article was mentioned in the source code.
This implies a design decision, also not explained in the OP.
We could choose either of the {Hoare, Lomuto} partition
schemes. The cited source broke this out as
a partition() helper, but that is absent in the OP code
so it must have been inlined, at the randomized
pivIdx assignment.
Both of the cited schemes produce a deterministic pivot point.
In my naïve reading I fail to find either scheme in
the OP, and the author has declined to offer hints
in the source text.
The code may very well be correct, but I do not
find that it is obviously correct.
Which makes it harder to assess that it is
easily maintainable by dev team members.
I suggested a decomposition into subproblems which,
when coded up by the OP in the usual way,
was likely to give a solution that other
team members could easily test and reason about.
helper
Never literally call a helper function helper().
You have an opportunity to explain what it does,
both via the name and via
Go Doc comments.
Don't waste that opportunity.
recursion
We write recursive code in order to break a hard
problem into a collection of easier subproblems.
The base case is clearly identified:
if start == end { return }
Then we have a bunch of code which apparently is trying
to preserve some invariant, while also making a loop
variant true such as squishing together bigger and
one of the end points. But there's no citation,
no Go Doc comments, no comments, and I didn't
find the {bigger, smaller} names as helpful as
they could be. I didn't find it obvious that the
code does a certain thing correctly.
And then another(?) base case is dealt with at once,
rather than after one more recursion:
if bigger == k {
return
identifiers
We have smaller which is literally always at least
one bigger than bigger.
Hmmm, ok, didn't want to call them i and j?
I suppose one must read them as indexes
pointing at values which are smaller or bigger
than one another.
But I didn't find it obvious how that is always enforced.
Where do you see “a lot of function calls”? There is a main function which calls a helper function, which calls itself (tail recursion). Is that a bad thing?
helper(nums, start, bigger - 1, k)
} else {
helper(nums, bigger + 1, end, k)
Thank you for pointing out the
TCO,
that's helpful.
Given that bigger will conditionally sometimes get bigger,
I wasn't quite seeing the loop variant which
guarantees we have a smaller problem to solve each time.
But yeah, I suppose we could readily code
this up as a loop.
Is recursion a bad thing?
No, certainly not, it can be a terrific way of
communicating a technical idea to another person,
of clearly demonstrating that the problem size
is continually being whittled down and at some
point we will terminate.
The trouble with the OP code is I didn't find
clear communication in it, so given the author's
current skill set I suggested an alternate
approach I felt could lead to better communication
and more maintainable code.
The review context said at least one set of inputs
made the code timeout, and I was concerned we might
be making more than the minimum required number of
recursive calls.
speed
... average complexity of O(n). Can you elaborate why you recommend to use a heap-based method instead?
I proposed a method with worst case complexity of \$O(n)\$, and with invariants and with loop variants that are easy to reason about.
A contest is an adversarial setting,
with the contest site trying to dream up crazy vectors
of \$10^5\$ integers which might trigger unfortunate
behavior for a proposed algorithm.
Worst case complexity of Quickselect is \$O(n^2)\$ quadratic,
though given the random pivot I don't believe the OP code
can fall into that trap.
I hope the contest organizers wouldn't mess
with rand.Intn() output.
radix sort
Here is a fresh perspective on the original problem.
The 16-bit input range is rather limited -- much more limited
than the vector size.
- Allocate \$2 \times 10^4 + 1\$ zeroed counters.
(The \$+ 1\$ is to accommodate a \$0\$ value,
and the \$2\$ accounts for the sign bit.)
- Make a scan of \$n\$ input values, incrementing
the corresponding counters, with cost \$O(n)\$.
- Scan backward through the counters until you
find \$k\$ non-zero counter values.
That last step may seem "wasteful", in that it likely
examines a great many boring zero counts.
But it examines fewer than \$n\$ counts,
so we're still at \$O(n)\$ time complexity overall,
at the expense of allocating a modest and
strictly bounded amount of memory.