-
Notifications
You must be signed in to change notification settings - Fork 1.9k
RFD 0225: In-Band MFA for SSH Sessions ππ§βπ» #59141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
rfd/0224-in-band-mfa-ssh-sessions.md
Outdated
All SSH traffic destined for target nodes will be proxied through the Proxy service, which will handle authentication, | ||
authorization, and session management. Direct SSH connections to nodes will deprecated and removed after the [transition | ||
period](#backward-compatibility). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by direct SSH connections to nodes here? Is this connections to direct dial nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so? It is the case where a user is able to dial directly to the node's SSH server if they're on the same network as the node without access to Proxy or Auth services and the SSH server is exposed on the network.
Now that I think about it after our conversation yesterday, do we even we support that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connections to direct dial nodes still go through the Proxy and are not direct connections from tsh to the Teleport SSH agent.
It is the case where a user is able to dial directly to the node's SSH server if they're on the same network as the node without access to Proxy or Auth services and the SSH server is exposed on the network
This is an infrequent but supported use case today, though it would likely be better solved via the Relay server. Though that begs the question, how will this work for users connecting via the Relay server? Will the Relay server call out to the Proxy/Decision service to mediate the MFA ceremony?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this. Either it continues to be handled mostly on the client or moved to the control plane and more deeply embedded in the protocol - I'm leaning towards the latter since this is architecturally consistent with how this new model we're moving towards with Proxy<->Clients in this RFD.
Isn't the Relay a proxy anyway? Couldn't we just implement TransportServiceV2
there also for consistency?
There are other ways we can solve this that we discussed, like following WebRTC patterns and embracing more peer-to-peer, but that is too much of a shift from where we are and where we have already started heading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the Relay a proxy anyway? Couldn't we just implement TransportServiceV2 there also for consistency?
Yes the relay will need to implement TransportServiceV2
- it currently already implements TransportService
to facilitate the connections. However, it will likely not have a local decision service running like the Proxy does. So it will likely have to broker the MFA ceremony with the Proxy/Auth somehow to uphold the guarantees that we want w.r.t in band per-session MFA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you missed one key difference: in the proposed change, per-session MFA certificates are no longer required to convey that MFA was completed for a session.
I'm in agreement about removing per-session MFA certificates, the flow I presented has the same outcome. The Proxy is in the position to accept/deny the connection.
In this aspect, the difference between our two proposals is whether:
- the Client presents the mfa response to Proxy, Proxy validates it with Auth
- the Client presents the mfa response to Auth with a challenge ID attached, Auth Validates it, Auth sends a validation success to the Proxy using the challenge ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I still don't see any benefit to the more complex StartAuthenticateChallenge
and CompleteAuthenticateChallenge
flow currently. It will just make the implementation more difficult and require more backend/Auth resources - to track/watch/reply to Proxy challenge requests in a headless-like implementation.
Another thing to keep in mind is that there can be multiple Auth servers, so I don't believe you can guarantee that the Client is handling the challenge on the same Auth Server that Proxy sent the StartAuthenticateChallenge
request to. To achieve this type of flow you would need to create a watcher for the Challenge ID on the backend rather than any inter-process communication. You could create this watcher in the Auth server as part of StartAuthenticateChallenge
, or make StartAuthenticateChallenge
a non streaming endpoint and have the Proxy create the watcher after starting the challenge. Headless takes the latter approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client solves the challenge and sends the MFAAuthenticateResponse back to the Proxy
What if the connection is being brokered by a Relay and not a Proxy? Do we want to give possession of the MFAAuthenticateResponse
to the Relay?
The crux of this problem and proposed solution boils down to the fact that the trust level of a Proxy and a Relay are different. The Relay intentionally has as few permissions as required to proxy end user connections to their destinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rosstimothy There's some similar conversation happening in a few threads on this PR about the security model of Relay vs. Proxy. It sounds like an important thing for us to hash out.
Should we pull that discussion into a single place?
Also, do you have some detail / docs on the intentions with Relay so that we can all be on the same page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the connection is being brokered by a Relay and not a Proxy? Do we want to give possession of the MFAAuthenticateResponse to the Relay?
Good question. It's true that the StartAuthenticateChallenge
flow is more controlled and prevents an attacker on the relay service from stealing an MFA response, but the fact that an MFA response can be stolen by an attacker in different scenarios is built into our threat model. Traditionally we don't treat an MFA response as a full blown secret. This is the reason we have scoped MFA challenges, so that even if an attacker manages to steal an MFA response within its 5 minute expiration window (including with reuse enabled), it can not be used by an attacker to perform unrelated MFA actions. Note that even to use the user session MFA response, the attacker would also need an active key/cert for the user.
I do however see the potential of StartAuthenticateChallenge
. If we want to move away from scoped MFA challenges to "action" MFA challenge, tying an MFA challenge to an exact action being carried out, this would be one direction to go.
Another direction which avoids the complexity of the watcher/etc would be an ACTION
MFA scope and a new ChallengeExtensions.ActionID
uuid field:
- Relay sends
EvaluateSSHAccess
(stream?) request to Decision Service - Decision Service (
EvaluateSSHAccess
) creates a random action UUID generated and sends it back to the Relay to indicate that MFA is required. - Relay sends the MFA action UUID to the client
- The client performs the MFA ceremony with the
ACTION
scope andActionID
field set. The action ID is stored in the challenge on the backend. - The client sends the MFA response to the Relay who sends it back to Decision Service through the
EvaluateSSHAccess
stream. - The Decision Service validates the MFA response and checks that the scope and action ID are correct using
rpc ValidateAuthenticateChallenge
. - The Decision Service returns an access permit with MFA verified to the Relay.
In this flow, we ensure the following:
- The Decision Service ensures that the MFA response provided by the client/relay was intended for that specific call to
EvaluateSSHAccess
via the uuid and scope matching- Since the Relay is the Policy Enforcement Point, this technically is not necessary. The Relay could produce the random UUID itself and validate it itself. For example. instead of being a streaming rpc,
EvaluateSSHAccess
could return an MFA required error. The Relay would then generate the UUID and kick off the MFA flow. It would then provide the MFA response and action UUID to the Decision Service to validate that it matches.
- Since the Relay is the Policy Enforcement Point, this technically is not necessary. The Relay could produce the random UUID itself and validate it itself. For example. instead of being a streaming rpc,
- The MFA response provided to the Relay can not be used for anything other than completing that exact
EvaluateSSHAccess
request. A stolen ACTION scoped MFA challenge is useless even if the thief has full user credentials. - The Relay Service does not require any additional permissions, It just needs
EvaluateSSHAccess
It also limits the changes needed to just adding a new scope and challenge extension field. It should be much easier to implement.
tldr; having the Relay service possess a user's scoped MFA challenge response to execute an EvaluateSSHAccess
request would not be a major departure from our current MFA security model. Even if we think Relay possession of the MFA challenge response must be avoided, I think there are simpler ways to do this, and this would likely be better tackled in a separate RFD / follow up to scoped MFA challenges (I volunteer as tribute).
We can continue discussion on the MFA flow this new thread.
edits: clarifications now that I understand the Relay is the PDP.
Signed-off-by: Chris Thach <[email protected]>
β¦rt mention. Signed-off-by: Chris Thach <[email protected]>
β¦andling and connection flow Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
rfd/0224-in-band-mfa-ssh-sessions.md
Outdated
### Overview | ||
|
||
All SSH traffic destined for target nodes will be proxied through the Proxy service, which will handle authentication, | ||
authorization, and session management. Direct SSH connections to nodes will deprecated and removed after the [transition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something on my first read through of the RFD, but, it's not entirely clear to me why direct SSH connections to nodes is something that must be deprecated as part of this work. It would be nice to understand this better as this is a strategy that we have recommended to customers fairly recently for non-human connections. I'm guessing this is because we want to shift the Node to only accepting/understanding Permits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! I need to reword this section as I learned a lot since I wrote this part of the RFD. More background is definitely needed. TODO for me. The use of Permits are nice, but not the main reason.
The main driver for this change is due to security gaps on how Per-session MFA is implemented. Part of this was hinted in the Why
section but to cut to the point, our Per-session MFA implementation was completely ineffective against a remote authentication bypass attack against a node in CVE-2025-49825. In this particular CVE case, MFA checks were completely bypassed (therefore useless) by leveraging a vulnerability in node trust. Technically it also exposed a vulnerability in our MFA implementation so...
As a direct action item from CVE-2025-49825, the goal now is to close those security gaps that should have prevented the remote authentication bypass in the first place, the whole point of MFA is that we require a second factor of auth. An attacker should not be able to authenticate with just a single credential (e.g., client cert) if Per-session MFA is enabled.
With the way we have implemented Per-session MFA with SSH, the MFA assertion is embedded into the client certificate via an extension i.e., two factors of auth combined into a single credential. Which really all means, our MFA implementation is really a Single Factor Authentication (SFA) implementation.
Additionally, it's hard to really enforce authz or security policies when you have 1000s or more node agents, all possibly running different Teleport versions and possibly enforcing this policy differently. What if we need to make a policy or security update to the enforcement logic? How do we ensure consistency and correctness? Rolling out a security patch for vulnerable node agents were another pain point of CVE-2025-49825.
This is why this RFD proposes we move from a distributed model to a centralized one so that the critical enforcement points are easier to secure, and therefore reducing the platform's attack surface. This is pretty consistent with our product philosophy here at Teleport, now that I think about it π€. We're also splitting the MFA assertion from the credential itself and embedding it deeper into the protocol e.g., SSH, Desktop, etc (they should not be combined!!! π). We would now force all connections through the Proxy (or a delegate TBD), so we can make sure this policy is consistently and correctly enforced for all access.
Unfortunately a potential casualty of this architectural move is we lose direct SSH access to the node. I say potentially because I'm not absolutely certain that we can't make it work while preserving the spirit of what we're trying to achieve. If you or anyone have any ideas, I'm all ears. From what I can tell, the SSH protocol doesn't allow us to natively perform MFA enforcement that easily integrates with the Teleport platform.
That being said, it's not a total loss AFAIK. For non-humans, they can still use tsh
or tbot
. If a Teleport user wants high throughput or traffic to flow locally instead of routed through the Proxy, I believe they will still get that with the upcoming Relay service.
Hopefully this helps! If not, happy to answer for questions and/or brainstorm with you.
P.S. Did you happen to see this thread? There is so more info there.
Credits to @rosstimothy for explaining all of this to me multiple times before I finally got it into my π§
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed response.
That being said, it's not a total loss AFAIK. For non-humans, they can still use tsh or tbot
Yep - in this case I am talking about cases where the customers are using tbot
.
I believe they will still get that with the upcoming Relay service.
I appreciate that the Relay service will allow for similar performance - but we should be cognizant that this will be a pretty big breaking change and work for customers to switch to from something that works for them today.
Unfortunately a potential casualty of this architectural move is we lose direct SSH access to the node. I say potentially because I'm not absolutely certain that we can't make it work while preserving the spirit of what we're trying to achieve. If you or anyone have any ideas, I'm all ears. From what I can tell, the SSH protocol doesn't allow us to natively perform MFA enforcement that easily integrates with the Teleport platform.
For non-human use-cases, we can't perform MFA ceremonies anyway. Would it be be possible to retain the the direct connection ability - but add the limitation that this will not function if Per-Session MFA is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem!
Just to be completely clear, we're both talking about this use case, right?
It is the case where a user is able to dial directly to the node's SSH server if they're on the same network as the node without access to Proxy or Auth services and the SSH server is exposed on the network
If yes, I don't think it will be possible to retain this ability, as the node would still be acting as an MFA enforcement point since it needs logic to decide whether Per-Session MFA is required by policy before it goes to enforce that policy (e.g., invoke Auth or Decision service -> decide if MFA enforcement must be enforced for this specific request).
The idea is we remove this decision making from the node altogether and centralize enforcement at a single point, the Proxy (or a delegate). The vision is that the node would not need to handle any of this, nor would the clients, like they currently do.
I'm not writing it off as impossible though, as I think this is a valuable use case. I'm happy to put this in as a future consideration into the RFD and we can come back with a v2 that improves on this design, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be completely clear, we're both talking about this use case, right?
Yeah - that's correct.
The idea is we remove this decision making from the node altogether and centralize enforcement at a single point, the Proxy (or a delegate). The vision is that the node would not need to handle any of this, nor would the clients, like they currently do.
Ok cool - I think I understand the reasoning/background now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsh ssh
only supports connections via the Proxy today - it does not do any direct dialing. For that use case one would need to connect via OpenSSH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsh ssh
only supports open connections via the Proxy today - it does not do any direct dialing. For that use case one would need to connect via OpenSSH.
This makes a lot of sense. Thanks. I recall you mentioning this to me, but it didn't click until now when you mapped specific commands to the different implementations π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If yes, I don't think it will be possible to retain this ability, as the node would still be acting as an MFA enforcement point since it needs logic to decide whether Per-Session MFA is required by policy before it goes to enforce that policy (e.g., invoke Auth or Decision service -> decide if MFA enforcement must be enforced for this specific request).
I don't see why the node would have to be acting as a PDP for this; asking the remote PDP service about the incoming connection could happily come back with "MFA needed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SSH protocol doesn't allow us to natively perform MFA enforcement that easily integrates with the Teleport platform
SSH is literally the only protocol where it's expected that clients spawn a binary of our choosing and speak with it rather than open a network connection, so if we can't do MFA enforcement in SSH we have little hope for it elsewhere, fwiw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the node would have to be acting as a PDP for this; asking the remote PDP service about the incoming connection could happily come back with "MFA needed".
Yes, this is the proposed plan so far. See refs to EvaluateSSHAccess
for more context.
SSH is literally the only protocol where it's expected that clients spawn a binary of our choosing and speak with it rather than open a network connection, so if we can't do MFA enforcement in SSH we have little hope for it elsewhere, fwiw.
The current proposal is that we do MFA enforcement at the control plane level. Are you saying that we should go a level deeper into the user's SSH network connection to do the enforcement?
β¦ package Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
β¦fallback to v1 Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
β¦gation Signed-off-by: Chris Thach <[email protected]>
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
|
||
Proxy->>Client: Send ClusterDetails | ||
|
||
Client->>Proxy: Establish SSH connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to this part of the codebase, so I'll ask a dumb question.
One of the primary goals of this RFD is to update our threat model such that being able to forge certificates should not allow you to establish an SSH connection if that resource is configured for per-session MFA.
In this step, what context from the prior steps is used to tie session establishment explicitly to the MFA success, and how is forgery of that prevented?
If you could point me to the right place in code (transport service?) that would be super helpful in wrapping my head around this flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've got it. This is all happening over a single stream, so the proxy would be validating the state of the stream is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to this part of the codebase, so I'll ask a dumb question.
No such thing as dumb questions. I'm new to the codebase too. It took me quite some time to wrap my head around what is done where.
One of the primary goals of this RFD is to update our threat model such that being able to forge certificates should not allow you to establish an SSH connection if that resource is configured for per-session MFA.
Correct!
In this step, what context from the prior steps is used to tie session establishment explicitly to the MFA success, and how is forgery of that prevented?
If you could point me to the right place in code (transport service?) that would be super helpful in wrapping my head around this flow.
Once you understand the high-level flow and you see how "MFA success" is encoded into the certificate, we can move over to enforcement.
Eventually the SSH server HandleConnection (primary entrypoint for all SSH conns) method on the node will parse the client certificate that was presented and evaluate access based on the client cert. Checking if the MFA requirement was satisfied is one of many checks based on authz policy.
If you keep diving down the π π³οΈ, you'll eventually end up where the magic happens.
how is forgery of that prevented?
Circling back to this. The RFD proposes the removal of these per-session MFA SSH certificate. Instead of temp certificates to convey MFA satisfaction, we move this information to be conveyed via the control plane (e.g., Transport service, Proxy Router, etc).
Like I mentioned earlier in my motivations comment, doing this allows us separate two factors of auth that were combined into one credential, mitigating against an attacker that is able to forge these certificates and bypass MFA. In other words, an attacker having access to just the SSH certificate (one factor) won't be enough when MFA is required, they'll need to compromise the control plane to provide the second factor.
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
`CompleteAuthenticateChallenge` RPC with the challenge ID and complete the challenge. Once the client completes the MFA | ||
challenge, the `TransportService` will receive the pass/fail result and `ProxySSH` will unblock and proceed accordingly. | ||
|
||
If the MFA verification fails, the stream is immediately terminated. Similarly, any connectivity issues with the Proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of error message does a user get in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added clarification here: 270307b.
I'm thinking we can provide more context in addition to the AccessDenied
or InternalServer
errors. Not sure if we need to go into that detail in the RFD. Happy to go to that level though if you think it's needed.
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
- `StartAuthenticateChallenge`: Only the Proxy service is permitted to invoke this RPC, allowing it to initiate MFA | ||
challenges on behalf of users. Direct user access to this RPC will be denied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work for connections that are routed a Relay instead of a Proxy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect connections routed via the Relay should behave the same as via the Proxy. From what I can tell, they share the same v1 TransportService
implementation.
We'll need to implement the v2 TransportService
and then have Proxy and Relay import that new v2 implementation. The Relay will now also need to dial Auth in order to invoke the new MFAService
to initiate MFA challenges.
I updated the RFD in 1e4568e to try clarify the Relay will need changes that mirror Proxy.
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
// SSH payload | ||
Frame ssh = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is forwarding the SSH agent frames no longer required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer required according to this comment by @espadolini
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
// Final response indicating the result of the MFA challenge. | ||
bool success = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other data that we might want to provide when a challenge is completed successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep this message as minimal as possible to reduce the sharing any potentially sensitive information beyond what was needed since the caller, Proxy/Relay, are expected to have a reduced privilege set.
That being said, your comment did make me realize that there was no way of conveying errors that happen between the User <-> Auth service. I added a message for the result of that interaction in b0181b2.
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
Per-session MFA SSH certificates are not required in the new design except for backwards compatibility with legacy | ||
clients. They were previously used to convey session metadata and enforce MFA at the Teleport Agent. With the new | ||
architecture, the Proxy and Auth service handle these responsibilities directly. Support for per-session MFA SSH | ||
certificates via `ProxySSH` RPC will initially be retained during the transition period to ensure backward compatibility | ||
with existing clients (see [Backward Compatibility](#backward-compatibility)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-session MFA SSH certificates are not required in the new design except for backwards compatibility with legacy clients.
Is compatibility only something that concerns clients? What happens if an older SSH agent is still around that only knows about MFA SSH certificates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I added some handling for legacy agents in 7583f96
rfd/0225-in-band-mfa-ssh-sessions.md
Outdated
end | ||
end | ||
|
||
Proxy->>Node: Dial target host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the node know whether MFA was required for the session?
My understanding is that we have the auth server complete the MFA challenge so that we don't have to elevate the level of trust given to the proxy and relay servers.
If the decision service lives in proxy / relay though, they can just decide not to do MFA, unless the node itself has some concept of "Wait, MFA should be required for this session, but I don't see a stapled permit."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to update this diagram after some previous changes. I did so in this commit.
How does the node know whether MFA was required for the session?
It should no longer be concerned with this since it happens before it receives the incoming dial from Proxy / Relay.
My understanding is that we have the auth server complete the MFA challenge so that we don't have to elevate the level of trust given to the proxy and relay servers.
Correct
If the decision service lives in proxy / relay though, they can just decide not to do MFA, unless the node itself has some concept of "Wait, MFA should be required for this session, but I don't see a stapled permit."
Going back to trust, in this new model, we're moving away from node-level MFA session checks and having the node trust that the Decision / Proxy / Relay services did their jobs correctly.
If the Decision / Proxy / Relay services decides not to enforce MFA when they are suppose to for a session, that is a issue and it is a known risk as defined in the Access Control Decision API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For proxy that makes sense, but this would be a new issue with relay right? Based on the comments elsewhere it seems like the relay isn't meant to have that level of authority, and is just supposed to be more like a network router.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relay shares the exact same TransportService as Proxy does. As proposed, it will eventually share the same v2 version of TransportService later.
I think Relay being just a pure network router isn't true because the current Transport v1 that it currently implements does more than that. For example, it has access to the user's auth context and eventually does authz checks on it.. Maybe a desired future state?
Are we all OK with just giving the same level of access to Relay as Proxy has to initiate MFA challenges with Auth? @espadolini @tigrato @rosstimothy would love your input here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we all OK with just giving the same level of access to Relay as Proxy has to initiate MFA challenges with Auth?
For what it's worth, I don't have a strong opinion on whether Relay should have the same security model as Proxy. I just want to make sure we're all on the same page as to whether it does, and more importantly that our customers understand the security model too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to trust, in this new model, we're moving away from node-level MFA session checks and having the node trust that the Decision / Proxy / Relay services did their jobs correctly.
If the Decision / Proxy / Relay services decides not to enforce MFA when they are suppose to for a session, that is a issue and it is a known risk as defined in the Access Control Decision API.
From gathering context and speaking briefly with @espadolini, it does not sound like the Relay Service is intended to be a Policy Enforcement Point, and even moreso not the Policy Decision Point. The Relay service is only meant to forward connections to the target node in the same capacity that the Proxy does today. The contradiction between the Relay service taking the responsibility of forwarding connections and the Proxy taking responsibility as the PEP, and enforcing that through connection forwarding decisions, has not been addressed yet as far as I can tell.
This whole concern seems to be beyond the scope of this RFD, so we shouldn't make assumptions about the Relay becoming an Policy Enforcement or Decision Point. In the current state of the Decision and Relay services, it does not seem like the Transport service is the right place to enforce in-band MFA.
We need to instead find a way to enforce MFA on the Node as it is today, just without the MFA certs. For example, we could do this by extending the SSH protocol to handle in-band MFA authorization, with the node being the one to start / validate MFA challenges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to #59141 (comment)
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Product: approved since this RFD has UX/Product changes.
Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
A new MFA service will be introduced to handle MFA challenges and responses instead of continuing to introduce new RPCs | ||
to the legacy AuthService. Existing MFA related RPCs in the Auth service can eventually be migrated to this new MFA | ||
service in a future effort. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: MFA flow discussion. Context:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Joerger and @espadolini for the context on the Relay service. I agree, given the purpose of the Relay service that it doesn't make sense to make it become a PEP or a PDP.
We need to instead find a way to enforce MFA on the Node as it is today, just without the MFA certs. For example, we could do this by extending the SSH protocol to handle in-band MFA authorization, with the node being the one to start / validate MFA challenges.
@rosstimothy and I initially wrote this off before I started the RFD because of possible issues with extending the SSH protocol (e.g., client compatibility) and problems rolling out authz updates to the agents (one of this RFD goals).
I'm going to do a deep dive to see if this is a feasible path. I know we implement a custom SSH conn handler. Need to see how much it can be extended and any possible issues that may come from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick PoC of extending golang.org/x/crypto/ssh server to support MFA in-band here. It proves that we can chain multiple authentication methods and require them all to succeed in order to grant a session to a user. I did the PoC with golang.org/x/crypto/ssh since it's the library we use for our Teleport SSH server.
Now we know that its possible, I'm going to extend our own implementation to make it work with tsh
(another PoC). Will post an update next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that to work we definitely need a MFA challenge response mechanism that is safe for the client to do when speaking to the agent rather than the control plane; the obvious one - which is unfortunately really tied to the SSH protocol itself, but maybe we can make the interaction with our MFA system somewhat generic - is that the challenge and response should be tied to the session identifier (which is available in x/crypto/ssh
in server auth callbacks but not in client ones except VERY indirectly and hackily) so that both parties know that the MFA challenge and response is tied to a specific SSH connection and any attack involving MITM or reuse will just not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cthach This is super interesting. I think this is a good case for Doyensec to review once you have an implementation ready as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that to work we definitely need a MFA challenge response mechanism that is safe for the client to do when speaking to the agent rather than the control plane; the obvious one - which is unfortunately really tied to the SSH protocol itself, but maybe we can make the interaction with our MFA system somewhat generic - is that the challenge and response should be tied to the session identifier (which is available in
x/crypto/ssh
in server auth callbacks but not in client ones except VERY indirectly and hackily) so that both parties know that the MFA challenge and response is tied to a specific SSH connection and any attack involving MITM or reuse will just not work.
Just want to acknowledge this great point and mention I'm merging this with @Joerger's proposed path and some learnings from the PoC. I should have the next iteration ready for review in the next few days (aiming for ASAP lol).
Thanks all for your continued patience and feedback!
Signed-off-by: Chris Thach <[email protected]>
β¦h implementation, and handle backwards compatibility Signed-off-by: Chris Thach <[email protected]>
|
||
#### Decision Service | ||
|
||
The Decision service will be updated to support evaluating SSH access requests with MFA challenge responses. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fspmarshall I would love to get your thoughts on this new approach before I open this for review.
What are your thoughts on the Decision service accepting the MFA challenge response before it issues a permit? I noticed that this is common pattern for our APIs to do. It will be responsible for calling Auth to validate it.
I also added a structured response to callers (per your suggestion) to more robustly inform them that a permit was denied because MFA wasn't done, compared to just an error message string on how it is implemented today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky question, and I've been talking it over with some others. There are pros and cons.
On the plus size, as you point out, representing the need for MFA as an error/Denial
is the common standard across teleport right now. Because of that, going this way (with MFA requirement being expressed via a Denial
) is definitely the easier path forward from a development perspective since thats already how the ssh decision logic currently works. And, once we accept that the PDP represents MFA requirements as a Denial
, then I see how it flows logically that one might also want the PDP to validate the MFA response as part of the Permit
path.
However, I have some misgivings about both the structure of representing MFA requirement as part of the denial, and about making the PPD be the response validator.
On the broader subject of how to represent the MFA requirement I have two main misgivings:
- One of our goals in the decision service is to have it eventually be a tool for policy introspection. I.e. an auditing/testing tool, not just an internal implementation detail of teleport's access-control. In that context, MFA requirement being enforced by a
Denial
isn't as desirable. Say, for example, that I have a user namedalice
who can ssh intonode.example.com
, but only with MFA. If I ask the decision service, "canalice
ssh intonode.example.com
?", getting aDenial
decision and needing to re-run the query with something like "canalice
ssh intonode.example.com
with--mfa-verified=true
?" to see her allowed access is very cumbersome. A much better experience would be to receive a single answer that says something like "yes,alice
ssh intonode.example.com
, but only if she provides MFA". I.e. having the MFA requirement be expressed as part of a conditional allow decision makes a lot more sense from a UX perspective than having it be a special-case denial. - Needing to "re-run" a decision after having already taken action based on a previous incarnation of that decision is IMO a weaker model overall. Roles/configuration may change between calls to the decision service. If we make a decision and that decision requires us to enforce certain conditions, it is a cleaner behavioral model to have the parameters of the allowed access and the conditions being enforced originate from the same "configuration state", rather than enforcing conditions derived from one configuration state, then deriving parameters of access from another. Additionally, any stateful element associated with decisions (e.g. rate-limiters, logging, etc) become easier to work with if decisions have a 1-to-1 relationship with access attempts.
On the more specific subject of making validation part of the PDP, I think this conflicts somewhat with the philosophical model of the PDP. One of the key design goals of the PDP is to provide robust logical isolation between "decision" logic and "enforcement" logic. This is why the PDP APIs accent a description of a user identity, rather than validating a user certificate, and why we don't try to, for example, increment the max_connections
semaphore inside of the PDP. Separating out enforcement/authentication/validation/etc from core decision logic makes decision logic easier to audit, easier to test, more portable, etc. I believe the same holds true for MFA challenge validation. MFA challenge validation is a responsibility that I would prefer to live outside of the PDP if possible, to help keep decision logic more cleanly isolation from enforcement logic.
In a perfect world, my preference would be that the PDP API represented MFA requirements as a condition within a Permit
decision, and that enforcement-side logic acted upon that condition. Something like:
rsp, _ := pdp.EvaluateSSHAccess(ctx, &EvaluateSSHAccessRequest{...})
if rsp.GetDenial() != nil {
rejectAccessAttempt()
return
}
if rsp.GetPermit().GetRequirePerSessionMfa() {
if err := doMFACeremony(); err != nil {
rejectAccessAttempt()
return
}
}
continueWithAccess()
I'm open to the idea that some of this might be more implementation churn than we want. Making the PDP represent MFA requirements as part of the permit would require updating the internals of services.AccessChecker
to allow for a mode where MFA requirement gets expressed as a parameter rather than an error. However I do feel that this would result in a better system overall.
Signed-off-by: Chris Thach <[email protected]>
β¦e MFA challenges Signed-off-by: Chris Thach <[email protected]>
Signed-off-by: Chris Thach <[email protected]>
However, when connecting to _multiple SSH hosts_ as part of a single user action (e.g., `tsh ssh root@env=example | ||
uptime`), the user may need to complete the MFA challenge multiple times depending on each target host's MFA | ||
requirements. | ||
|
||
This is due to the fact that the current design only evaluates MFA requirements _once_ on the first host matching the | ||
label, and if MFA was required, the per-session MFA certificate was used for all subsequent hosts without further MFA | ||
checks. Moving to in-band MFA enforcement means that each target host will independently evaluate MFA requirements | ||
during session establishment, which increases security. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rosstimothy or anyone
I'm not sure we can avoid this without somehow sharing a MFA challenge between different hosts. This opens us up to replay attacks if we allow sharing a single MFA challenge.
Is this how we accomplished "single MFA challenge for many hosts" with per-session MFA certificates like in the current implementation? If yes, would we like to continue making that security tradeoff in favor of a better UX?
The other option we discussed was some sort of meta-session that is platform-wide, but that can easily blow up the scope of this RFD, so ruling it out like we agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π₯ (it's a link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I didn't realize we allowed users to connect to multiple hosts with a single MFA cert like that, nice catch. We could try to adopt that same tsh db exec
flow, but not sure exactly how that fits in to the new flow. Just wanted to share the additional context before a deeper review.
Signed-off-by: Chris Thach <[email protected]>
39a9315
to
88e13d5
Compare
What
A RFD doc proposing to move multi-factor authentication (MFA) enforcement from out-of-band to in-band to SSH session establishment.
The proposal includes new gRPC endpoints and a MFA verification layer on top of SSH, with a migration and deprecation plan for backward compatibility.
See more background and motivations in this comment.
Proof of Concepts
There were multiple PoCs done for this RFD.
The first one whereTransportService
did MFA enforcement can be found hereThe second and latest iteration is where the SSH service at the Teleport Agent performs MFA enforcement within the SSH protocol is here.