-
Notifications
You must be signed in to change notification settings - Fork 7
Propose Tool Authorization API #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haiyanmeng The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Welcome @haiyanmeng! |
3f76fad to
9ae9f6e
Compare
docs/proposals/0008-ToolAuthAPI.md
Outdated
|
|
||
| The authentication of MCP tool access is not within the scope of this proposal, and will be explored separately in the future. | ||
|
|
||
| # Use Cases & Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference the personas and user journeys described in #5 after it is merged.
|
/cc @david-martin |
|
/cc @howardjohn |
| @@ -0,0 +1,459 @@ | |||
| # Tool Authorization in Agentic Networking | |||
|
|
|||
| This proposal defines authorization policies for tool access from AI agents running inside a Kubernetes cluster to MCP servers running in the Kubernetes cluster or outside of the Kubernetes cluster. By default, an AI agent can call initialize, notifications/initialized and tools/list. To enforce a "zero trust" security posture, a tools/call is denied unless it is allowed through the Tool Auth API described in this proposal. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the Core Capabilities description, it looks like the ingress use case is also in the scope for this SIG.
Should we include authorization for ingress in this proposal as well?
The use case I have in mind is when MCP backends are running inside the Kuberenetes cluster, and external agents invoke tool calls through the ingress gateway — where authorization should be enforced based on the identity in the request. For example, the scopes in the access token recommended by the MCP authorization spec.
| Source Source `json:"source"` | ||
| // Tools specifies a list of tools. | ||
| // +optional | ||
| Tools []string `json:"tools,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've shared my thoughts about this proposed API in the design doc already, but now feel like this is the place to put it on record again.
I strongly believe this may be going in the wrong direction. This API that consists on listing resources (today tools, but soon likely resources and prompts too) and principals all in an AuthPolicy CR have a few problems that IMO should not be overlooked:
- it doesn't scale well for large number of resources and principals;
- it copies very specific existing approaches, being heavily influenced by how one implementation in particular would probably support it underneath, rather than focusing on UX first;
- it oversteps on Kubernetes RBAC as the mean for storing authorisation data.
|
|
||
| #### Protocol-Aware Authorization for MCP Tools | ||
|
|
||
| As an AI Engineer, I want to create authorization policies to specify which individual tools (e.g., getWeather, sendEmail) my agent is permitted to call on an allow-listed MCP server, so that I can enforce least-privilege access at the specific tool-function level, not just the network endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on from a comment discussion in the original google doc,
how do you feel about a user journey about filtering the list of tools returned from a tool list so that tokens and time are not wasted trying to call a tool that a user doesn't have access to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im plus 1 to that, thats a good idea. How would the impl look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with having a TODO here, and it being explored in a protoype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking this for a while so we have a chance to work further on the details of the API proposal. I believe we need some changes, even at this early stage to ensure this works across multiple implementations, make sure we get early feedback.
| // when representing Kubernetes workload identities. | ||
| // | ||
| // +optional | ||
| Identities []string `json:"identities,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chiming in from SPIFFE over here (but if i'm missing context, let me know - I also commented this on the other PR and will bring that information over if valid). I can imagine that Identities can be useful for cross-cluster or more granular expressivity than the ServiceAccounts below.
However, thinking more about it, for the cross-cluster case, I have a problem with how one can establish trust with the foreign trust domain and actually verify the identity properly. In addition to the Identity string, there would need to be (either here or somewhere else) some information about that foreign trust domain. We cannot simply trust that the extracted name is correct but validate against some out-of-band knowledge. Where is that information supposed to come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal targets egress traffic. The identifies should be for pods running on the same Kubernetes cluster as where the AuthPolicy resource is installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the proposal is for egress, we should call this out at the top of the document, especially because it's either that or "zero trust".
docs/proposals/0008-ToolAuthAPI.md
Outdated
|
|
||
| #### Agent Identity | ||
|
|
||
| As an AI Engineer, I want to assign a unique, verifiable identity to my agent running in Kubernetes, so that gateways or external systems can securely authenticate it and make authorization decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This user journey doesn't seem to be directly covered by this proposal. Or is it?
It's more like suggesting that, by only allowing declaring permissions for Service Accounts and SPIFFE IDs, those should be, indirectly, assumed the ways to represent identity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see "assigning a unique, verifiable identity to an agent" as incompatible with authentication being a non-goal.
| Type *BackendType `json:"type"` | ||
| // MCP defines a MCP backend. | ||
| // +optional | ||
| MCP MCPBackend `json:"mcp,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Backend API is representative of egress gateways, a use case that transcends agentic, and AI in general. Having MCPBackend makes this more specifically an agentic AI API.
I expect this capability belongs in Gateway API itself. If we move forward with this API here, how do we see threading the needle with all the other use cases and stakeholders for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is discussed in https://kubernetes.slack.com/archives/C09P6KS6EQZ/p1762448311074899
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I had planned to base the Backend in this proposal on what you have here in order to avoid repeating work. A GEP would make my life a bit easier in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to a GEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing feedback and discussions we had over kubecon gwapi break room sessions.
We agreed that Backend probably belongs to Gateway API. We also agreed Agentic net will experiment with this, in its own pace and will move this to a GEP once we have something we feel is implementable.
As shane said as well, it will likely require collaboration with other WGs and Gateway API to make sure its the "right" backend that fits more generic needs, but we should not hold iteration and prototyping at this point of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In agentic network we have three primary backends: LLMs(cloud or on prem hosted), MCP and Agent as an agent can call LLMs, tools or other agents. In envoy ecosystem, we have alpha version of Backend implemented in envoy gateway project and in envoy ai gateway project we introduced AIServiceBackend for LLM egress cases. I think there are two options:
- Add a generic
Backendin Gateway API and here we defineAgenticBackendto have a reference toBackend - Composite backend like the way defined here, but allow defining MCP, LLM, Agent specific backend fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prototyping and iterating on a single CRD with a type field would be quicker than introducing additional CRDs at this stage.
So, 2 is my recommendation.
9ae9f6e to
184945d
Compare
32a24db to
3b898c3
Compare
docs/proposals/0008-ToolAuthAPI.md
Outdated
|
|
||
| const ( | ||
| // ActionAllow allows requests that match the policy rules. | ||
| ActionAllow AuthPolicyAction = "ALLOW" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deny lists can be convenient in some use cases but are also a bad practice security-wise. IMO, we should not encourage it.
My recommendation is to drop this field altogether and support only allow lists at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the sentiment, but I think this structure is more extensible for if we do want to add a deny list mode in the future (compared to having to introduce a new field and deal with conflicts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIAC, adding a new field generates no conflict, as long as it's an optional field and it defaults to ALLOW. However, I'm more interested in why we would want to add deny list mode in the future. That would be a hard decision to say the least, picking convenience over security. Or perhaps you're envisioning some other feature @keithmattix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point on the new field; not as difficult as I originally thought.
Re: denylist - I think it's misleading to think of all security best practices in absolutes, especially in API design. A deny list, while not ideal, is often the best certain organizations can do at a given moment, and it's better for them to have some traffic paths blocked vs. leaving everything open until every known good path can be ascertained. Organizations won't risk an outage to adopt a default deny posture before they're ready
| - apiGroups: ["agentic.networking.x-k8s.io"] | ||
| resources: ["backends"] | ||
| resourceNames: ["mcp-server2"] | ||
| verbs: ["read_wiki_structure"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this would work, the resource doesn't have to be a kind known by the API server. In this particular example, tool/call is implicit. Another option could be:
- apiGroups: ["agentic.networking.x-k8s.io"]
resources: ["mcptools"]
resourceNames: ["mcp-server2/read_wiki_structure"]
verbs: ["call"]|
|
||
| #### **Latency (The `ext_authz` Hop)** | ||
|
|
||
| Because it relies on Envoy's *external authorization* API, every request that hits your Gateway must pause, make a network hop to the Authorino service, wait for a decision, and then resume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I know at a glance this is not obvious, but rather hidden as an implementation detail of Kuadrant. Although Authorino indeed implements Envoy ext_authz and could be integrated directly from an HTTPRoute, when configured via a Kuadrant AuthPolicy, it does not leverage the Envoy ext_authz filter. Instead, Kuadrant makes the call (or not) to Authorino from a wasm module that runs in the same process as the proxy itself. Therefore, the extra hop is not always needed, thus the latency issue due to an extra hop is true, just not at "every request".
This proposal aims to address the user journeys described in kubernetes-sigs#5
except for the AuthScheme CRD The AuthScheme CRD can be added in a followup PR.
3b898c3 to
eb58ad8
Compare
We will address these two areas in a follow-up PR
eb58ad8 to
472db92
Compare
|
|
||
| The CRD names may change depending on the OSS feedback. | ||
|
|
||
| > **_NOTE:_** The API does not cover identity extraction, or request authentication. We will cover them in a follow-up Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guicassolato , I incorporated all your suggestions in haiyanmeng#1 except for the AuthScheme CRD into this PR, and added this note.
After this PR is merged, you can create a PR against https://github.com/kubernetes-sigs/kube-agentic-networking for the AuthScheme CRD. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep both APIs in a unified proposal as we originally discussed. Here's my reasoning: The two APIs address different use cases and architectural patterns. I believe presenting them together allows implementations to evaluate which approach (or both) fits their needs, rather than prescribing one path forward.
Specific concerns with AccessPolicy as a standalone solution: For dynamic environments (like MCP with Dynamic Client Registration and OAuth2), the AccessPolicy pattern requires constantly reconciling a potentially large policy resource as principals change. This creates scalability challenges with:
- Large policy objects being frequently updated
- Potential conflicts during concurrent reconciliation
- Storage/etcd pressure for environments with many dynamic principals
Why AuthScheme complements this: AuthScheme addresses these concerns by delegating to external trust sources (OIDC providers, Kubernetes RBAC) rather than storing all permission data in-cluster. The CEL-based extraction and verification model supports:
- Service Account token validation for Kube-native apps
- OIDC federation for dynamic client scenarios
- Kubernetes RBAC for environments preferring existing authorization mechanisms
- In-policy pattern-matching authorisation rules for things like JWT claim checks
I see value in both approaches for different contexts, which is why I'd advocate for landing them together. That said, I recognize this is a community decision. Is there a specific concern about including AuthScheme in this PR that we could address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guicassolato , thanks for sharing your thought. I do see the value of having an AuthScheme resource. My main reasons for introducing AuthScheme in a follow-up PR are:
- At this point the community seems okay with
BackendandAccessPolicy(formerly namedAuthPolicy). However,AuthSchemeis new, and it may take some time for the community to reach an agreement on its design details. - Putting
AuthSchemein a separate PR allows higher velocity. If there is feedback from the community to address, you can address it directly without going through https://github.com/haiyanmeng/kube-agentic-networking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates scalability challenges with:
I dont think we should be bothered with scalability at this stage. We are in a prototyping phase. I would like to understand more the potential conflicts argument though.
The AuthScheme, while robust is very opinionated way of doing something, and its complex. I do want to have more discussion on that in a wider forum.
I think its a win that we have some consensus on "Backend" and "AccessPolicy" (and thanks for the awesome changes you suggested) but my real concern is that we start with something very complex to begin with. I'd rather tackle it one by one.
Backend is already a new type folks needs to be doing, and AccessPolicy is another one.
Anyway, I guess the summary of what I say is, I do understand AuthScheme place, but I would love to have that as a fast follow thing to discuss and see exactly the user flows we are covering and how. We spent a bit amount of time discussing Backend and AccessPolicy (or formerly the ToolAuthPolicy) to gather the feedback but we havent done this for AuthScheme. And I do think its important to allow a UX where users dont need an AuthScheme, rather they can use it (or some other mechanism, TBD based on discussions+feedback) to customize/enhance what they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 I've been lurking for a while with some ideas, but I just carved out some time to read the entire proposal and comments, so apologies for popping in randomly!
As a quick middle-ground proposal for the authN question: instead of defining the whole extraction flow in this PR, we could just define the canonical identity formats that AccessPolicy should match, similar to how fine-grained authorization systems like Zanzibar handle identities from different sources.
e.g., these three can all refer to the same workload:
-
spiffe://cluster.local/ns/default/sa/agent -
{ namespace: "default", name: "agent" } -
{ iss: "https://issuer.example.com", sub: "agent@default" }
which can be normalised to something like: sa:default/agent
So we could define formats such as sa:<ns>/<name>, spiffe:<uri>, oidc:<issuer>/<sub>, external:<id>, and leave the details of how those identities get extracted to a follow-up proposal. Without this, two implementations could evaluate AccessPolicy against different identity strings for the same principal, causing potential consistency violations and policy drift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let's discuss the AuthScheme in haiyanmeng#1. I'm very interested in your points regarding it being opinionated and complex @LiorLieberman.
| ServiceName string `json:"serviceName,omitempty"` | ||
| // Hostname defines the hostname of the external MCP service to connect to. | ||
| // +optional | ||
| Hostname string `json:"hostname,omitempty"` | ||
| // Port defines the port of the backend endpoint. | ||
| // +required | ||
| Port int32 `json:"port"` | ||
| // Path is the URL path of the MCP backend for MCP traffic. | ||
| // A MCP backend may serve both MCP traffic and non-MCP traffic. | ||
| // If not specified, the default is /mcp. | ||
| // +optional | ||
| // +kubebuilder:default:=/mcp | ||
| Path string `json:"path,omitempty"` | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these fields are not MCP specific, it can be used by other agentic backends like LLM or agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So move into BackendSpec?
Abstract when needed as mcp specifics arise.
Maybe something like MCP protocol version could be useful, but I'd prefer to have a concrete use case.
| } | ||
|
|
||
| // AccessPolicySpec defines the desired state of AccessPolicy. | ||
| type AccessPolicySpec struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like authorization policy which can be solved by ext authz using OPA which is far more expressibe and flexible than defining k8s resource here?
|
|
||
| This proposal defines authorization policies for tool access from AI agents running inside a Kubernetes cluster to MCP servers running in the Kubernetes cluster or outside of the Kubernetes cluster. By default, an AI agent can call initialize, notifications/initialized and tools/list. To enforce a "zero trust" security posture, a tools/call is denied unless it is allowed through the Tool Auth API described in this proposal. | ||
|
|
||
| # ⚠️ Warning: Experimental API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # ⚠️ Warning: Experimental API | |
| # 🚫🚫 **STOP – EXPERIMENTAL API** 🚫🚫 | |
| **Do NOT use this in production.** |
Making it more obvious not to use this yet.
No description provided.