Skip to content

Conversation

@SasSwart
Copy link
Contributor

@SasSwart SasSwart commented Oct 2, 2025

Relates to coder/internal#934

This PR provides a mechanism to filter provisioner jobs according to who initiated the job.
This will be used to find pending prebuild jobs when prebuilds have overwhelmed the provisioner job queue. They can then be canceled.

If prebuilds are overwhelming provisioners, the following steps will be taken:

# pause prebuild reconciliation to limit provisioner queue pollution:
coder prebuilds pause 
# cancel pending provisioner jobs to clear the queue
coder provisioner jobs list --initiator="prebuilds" --status="pending" | jq ... | xargs -n1 -I{} coder provisioner jobs cancel {}
# push a fixed template and wait for the import to complete
coder templates push ... # push a fixed template
# resume prebuild reconciliation
coder prebuilds resume

This interface differs somewhat from what was specified in the issue, but still provides a mechanism that addresses the issue. The original proposal was made by myself and this simpler implementation makes sense. I might add a --search parameter in a follow-up if there is appetite for it.

Potential follow ups:

  • Support for this usage: coder provisioner jobs list --search "initiator:prebuilds status:pending"
  • Adding the same parameters to coder provisioner jobs cancel as a convenience feature so that operators don't have to pipe through jq and xargs

@SasSwart SasSwart changed the title Jjs/internal 934 feat: add filtering by initiator to provisioner job listing in the CLI Oct 2, 2025
t.Run("Cancel", func(t *testing.T) {
t.Parallel()

db, ps := dbtestutil.NewDB(t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this inside the subtest to properly isolate these tests from the list subcommand tests

}
})

t.Run("List", func(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were no tests for the list subcommand. I've added some basic test cases as a scouting effort, but I focus on the tests to prove my own work right now.

@SasSwart SasSwart marked this pull request as ready for review October 2, 2025 09:30
@SasSwart SasSwart requested a review from mafredri October 2, 2025 09:30
@SasSwart SasSwart requested a review from ssncferreira October 2, 2025 09:46
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and thanks for expanding test coverage.

I don't see anything blocking, so approving, but had a few comments inline.


var initiatorID *uuid.UUID

if initiatorIDStr != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be ID or name, right? Suggest renaming it to be more generic.

// @Param ids query []string false "Filter results by job IDs" format(uuid)
// @Param status query codersdk.ProvisionerJobStatus false "Filter results by status" enums(pending,running,succeeded,canceling,canceled,failed)
// @Param tags query object false "Provisioner tags to filter by (JSON of the form {'tag1':'value1','tag2':'value2'})"
// @Param initiator_id query string false "Filter results by initiator ID" format(uuid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to enforcing uuid here, but typically we allow for both name and UUID for users (which this kind of is), and usually even me. Food for thought before we introduce this as part of stable API.

@SasSwart
Copy link
Contributor Author

SasSwart commented Oct 2, 2025

@mafredri do you have opinions for or against the coder provisioner jobs list --search "initiator:prebuilds status:pending" syntax? I'd like to know whether I should work on a follow up for it or whether that can wait.

Same question applies to whether or not I should add bulk deletion using the --initiator flag to coder provisioner jobs cancel. Do you see value in this?

@mafredri
Copy link
Member

mafredri commented Oct 2, 2025

@mafredri do you have opinions for or against the coder provisioner jobs list --search "initiator:prebuilds status:pending" syntax? I'd like to know whether I should work on a follow up for it or whether that can wait.

I don't mind adding --search, I like to have feature parity with UI and also a unified interface across commands (e.g. coder list).

If we want to keep --initiator and --status, though, we need to either deprecate them or change the documentation to say alias for "--search status:[value]", etc. The --search query is less discoverable so I somewhat prefer the alias route.

Same question applies to whether or not I should add bulk deletion using the --initiator flag to coder provisioner jobs cancel. Do you see value in this?

I'm not sold it's needed on cancel as canceling everything on such a granularity seems potentially dangerous. By listing you are encouraged to vet the list before proceeding.

I think a step 1 could be to just allow passing multiple IDs to cancel.

# via args
coder provisioner jobs cancel $(coder provisioner jobs list --initiator prebuild --quiet)

# via stdin
coder provisioner jobs list --initiator prebuild --quiet | coder provisioner jobs cancel --stdin

See coder exp task list --quiet for motivation for the quiet flag (prints only IDs, same behavior as docker commands).

Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good! I just left a few questions in the inline comments. A couple of additional thoughts:

  • In the PR description, you mention a --status="pending" flag, but I see the CLI already has a status option:
     -s, --status [offline|idle|busy], $CODER_PROVISIONER_LIST_STATUS
            Filter by provisioner status.
    
    Could you update the description? (edit: was looking at the coder provisioner list CLI command by mistake)
  • In the description you also note:
    Adding the same parameters to coder provisioner jobs cancel as a convenience feature so that operators don't have to pipe through jq and xargs
    
    I wonder if it might be more reliable to pass an explicit list of job IDs to the cancel command. That way, you know exactly what’s being canceled and avoid the risk of unintentionally including other jobs.
  • Not necessarly for this PR, but down the line it could be really useful to support search queries for template (and possibly version, defaulting to active). The main use case I see for this command would be when a faulty template is published, and users want to quickly delete all the related prebuild jobs. That could be a nice follow-up.


var initiatorID *uuid.UUID

if initiatorIDStr != "" {
Copy link
Contributor

@ssncferreira ssncferreira Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m curious, what made you go with initiator instead of owner here? If I’m remembering correctly, prebuild claims also use the prebuild system user as the initiator (link), which makes me wonder if this might unintentionally filter prebuild claim related jobs.

Also, for consistency with other parts of Coder (e.g., coder list --search and the UI both use owner), I wonder if using a search flag and owner might be a clearer fit. What do you think?

Copy link
Contributor Author

@SasSwart SasSwart Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Owner doesn't fit the data model, because jobs can either be workspace builds or template imports. Workspace builds have an owner, but template versions (for import jobs) do not.

Therefore the owner label is only applicable to some jobs and that confuses the interface.

I initially wanted to include the owner label but it was the clearer, simpler solution to go with initiator. Furthermore, the searchquery package shows that each searchable resource defines its own list of searchable attributes. Therefore I don't think using initiator is a problem.

We can add owner in a separate PR if you'd like. But it would come with the implicit filter that only workspace build jobs are returned. Unless we want to assume that created_by is the same thing as owner, but that makes me uncomfortable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that is a good point about the import job. Let's keep it as initiator in that case. Maybe it would make sense to add this reasoning in the PR description?

job := codersdk.ProvisionerJob{
ID: provisionerJob.ID,
OrganizationID: provisionerJob.OrganizationID,
InitiatorID: provisionerJob.InitiatorID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a test around claiming prebuilds to ensure that InitiatorID isn’t the prebuilds user. Otherwise, there’s a risk we could inadvertently cancel prebuild-claiming jobs.

Copy link
Contributor Author

@SasSwart SasSwart Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just fix that bug (which I'm happy to do in a separate PR) and rather have claims be initiated by the user that claimed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not 100% sure this is a bug, I think it’s more a matter of perspective. It’s true that the user action initiates the claim process, but internally it’s the prebuilds user who actually performs the claim.
I’m okay with changing the initiator to be the user who claims the prebuild, since I still think that makes the most sense, I’m just not 100% sure if there might be any problems that come with making that change.

In either case, I still think we should have a test for this scenario to cover that specific case.

@SasSwart
Copy link
Contributor Author

SasSwart commented Oct 2, 2025

@ssncferreira where did you see -s, --status [offline|idle|busy], $CODER_PROVISIONER_LIST_STATUS?

I just ran the command using --status="pending" and it works fine.

See attached:
image

@ssncferreira
Copy link
Contributor

@ssncferreira where did you see -s, --status [offline|idle|busy], $CODER_PROVISIONER_LIST_STATUS?

I just ran the command using --status="pending" and it works fine.

See attached: image

You’re totally right, I was looking at the coder provisioner list CLI command instead 🤦‍♀️ That’s what I get for multitasking 😅, sorry about the confusion!

@SasSwart SasSwart enabled auto-merge (squash) October 6, 2025 08:46
@SasSwart SasSwart merged commit d17dd5d into main Oct 6, 2025
28 of 30 checks passed
@SasSwart SasSwart deleted the jjs/internal-934 branch October 6, 2025 08:56
@github-actions github-actions bot locked and limited conversation to collaborators Oct 6, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants