Skip to content

Conversation

@fioan89
Copy link
Collaborator

@fioan89 fioan89 commented Nov 12, 2025

Netflix uses custom MFA that requires CLI middleware to handle auth flow. The custom CLI implementation on their side intercepts 403 responses from the REST API, handles the MFA challenge, and retries the rest call again. The MFA challenge is handled only by the start and ssh actions. The remaining actions can go directly to the REST endpoints because of the custom header command that provides MFA tokens to the http calls.

Both Gateway and VS Code extension delegate the start logic to the CLI, but not Toolbox which caused issues for the customer. This PR ports some of the work from Gateway in Coder Toolbox.

Netflix uses custom MFA that requires CLI middleware to handle auth flow. The custom CLI
implementation on their side intercepts 403 responses from the REST API, handles the MFA
challenge, and retries the rest call again. The MFA challenge is handled only by the `start`
and `ssh` actions. The remaining actions can go directly to the REST endpoints because
of the custom header command that provides MFA tokens to the http calls.

Both Gateway and VS Code extension delegate the start logic to the CLI, but not Toolbox
which caused issues for the customer. This PR ports some of the work from Gateway in Coder Toolbox.
Comment on lines +134 to +142
// cli takes 15 seconds to move the workspace in queueing/starting state
// while the user won't see anything happening in TBX after start is clicked
// During those 15 seconds we work around by forcing a `Queuing` state
while (startJob.isActive && client.workspace(workspace.id).latestBuild.status.isNotStarted()) {
state.update {
WorkspaceAndAgentStatus.QUEUED.toRemoteEnvironmentState(context)
}
delay(1.seconds)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud, maybe it would be possible this could incorrectly set the state to queued? Like suppose the workspace goes to canceling, then we could possibly update to queued mistakenly, depending on the timing.

Probably not a huge deal, it would eventually correct itself and probably be brief at worst, but one thought is we could check to see if the latest build has changed yet which would indicate a new build finally actually started. Also occurs to me we probably only need to update to queued once.

// ...
// During those 15 seconds ...
state.update { WorkspaceAndAgentStatus.QUEUED.toRemoteEnvironmentState(context) }
// wait for the build to actually start since there is a dry run first
while (startJob.isActive && client.workspace(workspace.id).latestBuild == lastBuild) {
  delay(1.seconds)
}

Or, maybe a more general fix would be for the updater itself to only update the status if the build changed otherwise preserve the current state so we could set the state once manually and have it persist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that is every 5 seconds we run the general polling loop. For 15 seconds the workspace has the same status (in most cases it is stopped). We'd have to stop the general polling loop otherwise the workspaces for which we dispatched the start command will go from Stopped to Queued and then back to Stopped at the next general poll.
But if we pause the general poll loop we miss updates on the rest of workspaces. I'm still pondering on the best approach...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about my general fix idea right? I was thinking the build check would prevent going back to stopped. So something like:

  1. General poll runs, gets stopped build and sets status to stopped.
  2. User starts workspace and we set the status to queued.
  3. General poll runs, gets the same stopped build again, skips setting the status though because it is the same build it saw previously.
  4. General poll runs twice more, preserves the current status each time.
  5. Workspace finally actually starts after 15 seconds which creates a new started build.
  6. General poll runs, gets started build, updates status to started because it is a new build.

But idk if this is any better than having the startIsInProgress gate, mostly just a thought. I suppose it would let us do things like also immediately set a workspace to stopping when stopping a workspace rather than wait for the request round trip to update the status, which could be nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing implementation works in almost exactly what you proposed except the part about gets the same stopped build again, skips setting the status though because it is the same build it saw previously. which does things a bit differently. Instead of the general poll "remembering" the previous poll state it is actually forbidden to change the status of a workspace we know should be starting soon. A second polling loop only for the workspace that should be starting has the right to move the status from "queued" to "starting", after which the secondary poll loop dies and the primary poll loop will get back the full rights.

Now that I'm explaining it, it sounds like a bit of over engineering (two polling loops) though.

Copy link
Member

@code-asher code-asher Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup yup, similar but only for this specific starting flow, whereas my proposal works for any state change that might take a while, plus it only needs the one loop.

@fioan89 fioan89 merged commit e24f564 into main Nov 20, 2025
6 checks passed
@fioan89 fioan89 deleted the impl-start-workspace-via-cli branch November 20, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants