-
Notifications
You must be signed in to change notification settings - Fork 4
impl: start the workspace via Coder CLI #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Netflix uses custom MFA that requires CLI middleware to handle auth flow. The custom CLI implementation on their side intercepts 403 responses from the REST API, handles the MFA challenge, and retries the rest call again. The MFA challenge is handled only by the `start` and `ssh` actions. The remaining actions can go directly to the REST endpoints because of the custom header command that provides MFA tokens to the http calls. Both Gateway and VS Code extension delegate the start logic to the CLI, but not Toolbox which caused issues for the customer. This PR ports some of the work from Gateway in Coder Toolbox.
| // cli takes 15 seconds to move the workspace in queueing/starting state | ||
| // while the user won't see anything happening in TBX after start is clicked | ||
| // During those 15 seconds we work around by forcing a `Queuing` state | ||
| while (startJob.isActive && client.workspace(workspace.id).latestBuild.status.isNotStarted()) { | ||
| state.update { | ||
| WorkspaceAndAgentStatus.QUEUED.toRemoteEnvironmentState(context) | ||
| } | ||
| delay(1.seconds) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud, maybe it would be possible this could incorrectly set the state to queued? Like suppose the workspace goes to canceling, then we could possibly update to queued mistakenly, depending on the timing.
Probably not a huge deal, it would eventually correct itself and probably be brief at worst, but one thought is we could check to see if the latest build has changed yet which would indicate a new build finally actually started. Also occurs to me we probably only need to update to queued once.
// ...
// During those 15 seconds ...
state.update { WorkspaceAndAgentStatus.QUEUED.toRemoteEnvironmentState(context) }
// wait for the build to actually start since there is a dry run first
while (startJob.isActive && client.workspace(workspace.id).latestBuild == lastBuild) {
delay(1.seconds)
}Or, maybe a more general fix would be for the updater itself to only update the status if the build changed otherwise preserve the current state so we could set the state once manually and have it persist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with that is every 5 seconds we run the general polling loop. For 15 seconds the workspace has the same status (in most cases it is stopped). We'd have to stop the general polling loop otherwise the workspaces for which we dispatched the start command will go from Stopped to Queued and then back to Stopped at the next general poll.
But if we pause the general poll loop we miss updates on the rest of workspaces. I'm still pondering on the best approach...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about my general fix idea right? I was thinking the build check would prevent going back to stopped. So something like:
- General poll runs, gets
stoppedbuild and sets status tostopped. - User starts workspace and we set the status to
queued. - General poll runs, gets the same
stoppedbuild again, skips setting the status though because it is the same build it saw previously. - General poll runs twice more, preserves the current status each time.
- Workspace finally actually starts after 15 seconds which creates a new
startedbuild. - General poll runs, gets
startedbuild, updates status tostartedbecause it is a new build.
But idk if this is any better than having the startIsInProgress gate, mostly just a thought. I suppose it would let us do things like also immediately set a workspace to stopping when stopping a workspace rather than wait for the request round trip to update the status, which could be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing implementation works in almost exactly what you proposed except the part about gets the same stopped build again, skips setting the status though because it is the same build it saw previously. which does things a bit differently. Instead of the general poll "remembering" the previous poll state it is actually forbidden to change the status of a workspace we know should be starting soon. A second polling loop only for the workspace that should be starting has the right to move the status from "queued" to "starting", after which the secondary poll loop dies and the primary poll loop will get back the full rights.
Now that I'm explaining it, it sounds like a bit of over engineering (two polling loops) though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup yup, similar but only for this specific starting flow, whereas my proposal works for any state change that might take a while, plus it only needs the one loop.
Netflix uses custom MFA that requires CLI middleware to handle auth flow. The custom CLI implementation on their side intercepts 403 responses from the REST API, handles the MFA challenge, and retries the rest call again. The MFA challenge is handled only by the
startandsshactions. The remaining actions can go directly to the REST endpoints because of the custom header command that provides MFA tokens to the http calls.Both Gateway and VS Code extension delegate the start logic to the CLI, but not Toolbox which caused issues for the customer. This PR ports some of the work from Gateway in Coder Toolbox.