Skip to content

Conversation

@haiyanmeng
Copy link

No description provided.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haiyanmeng
Once this PR has been reviewed and has the lgtm label, please assign liorlieberman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 4, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @haiyanmeng!

It looks like this is your first PR to kubernetes-sigs/kube-agentic-networking 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kube-agentic-networking has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 4, 2025

The authentication of MCP tool access is not within the scope of this proposal, and will be explored separately in the future.

# Use Cases & Motivation
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference the personas and user journeys described in #5 after it is merged.

@shaneutt
Copy link
Member

shaneutt commented Nov 4, 2025

/cc @david-martin

@rramkumar1
Copy link
Contributor

/cc @howardjohn

@@ -0,0 +1,459 @@
# Tool Authorization in Agentic Networking

This proposal defines authorization policies for tool access from AI agents running inside a Kubernetes cluster to MCP servers running in the Kubernetes cluster or outside of the Kubernetes cluster. By default, an AI agent can call initialize, notifications/initialized and tools/list. To enforce a "zero trust" security posture, a tools/call is denied unless it is allowed through the Tool Auth API described in this proposal.
Copy link

@zhaohuabing zhaohuabing Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the Core Capabilities description, it looks like the ingress use case is also in the scope for this SIG.

Should we include authorization for ingress in this proposal as well?

The use case I have in mind is when MCP backends are running inside the Kuberenetes cluster, and external agents invoke tool calls through the ingress gateway — where authorization should be enforced based on the identity in the request. For example, the scopes in the access token recommended by the MCP authorization spec.

Source Source `json:"source"`
// Tools specifies a list of tools.
// +optional
Tools []string `json:"tools,omitempty"`
Copy link

@guicassolato guicassolato Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've shared my thoughts about this proposed API in the design doc already, but now feel like this is the place to put it on record again.

I strongly believe this may be going in the wrong direction. This API that consists on listing resources (today tools, but soon likely resources and prompts too) and principals all in an AuthPolicy CR have a few problems that IMO should not be overlooked:

  1. it doesn't scale well for large number of resources and principals;
  2. it copies very specific existing approaches, being heavily influenced by how one implementation in particular would probably support it underneath, rather than focusing on UX first;
  3. it oversteps on Kubernetes RBAC as the mean for storing authorisation data.


#### Protocol-Aware Authorization for MCP Tools

As an AI Engineer, I want to create authorization policies to specify which individual tools (e.g., getWeather, sendEmail) my agent is permitted to call on an allow-listed MCP server, so that I can enforce least-privilege access at the specific tool-function level, not just the network endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on from a comment discussion in the original google doc,
how do you feel about a user journey about filtering the list of tools returned from a tool list so that tokens and time are not wasted trying to call a tool that a user doesn't have access to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im plus 1 to that, thats a good idea. How would the impl look like?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with having a TODO here, and it being explored in a protoype

Copy link

@guicassolato guicassolato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking this for a while so we have a chance to work further on the details of the API proposal. I believe we need some changes, even at this early stage to ensure this works across multiple implementations, make sure we get early feedback.

// when representing Kubernetes workload identities.
//
// +optional
Identities []string `json:"identities,omitempty"`
Copy link

@maia-iyer maia-iyer Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chiming in from SPIFFE over here (but if i'm missing context, let me know - I also commented this on the other PR and will bring that information over if valid). I can imagine that Identities can be useful for cross-cluster or more granular expressivity than the ServiceAccounts below.

However, thinking more about it, for the cross-cluster case, I have a problem with how one can establish trust with the foreign trust domain and actually verify the identity properly. In addition to the Identity string, there would need to be (either here or somewhere else) some information about that foreign trust domain. We cannot simply trust that the extracted name is correct but validate against some out-of-band knowledge. Where is that information supposed to come from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal targets egress traffic. The identifies should be for pods running on the same Kubernetes cluster as where the AuthPolicy resource is installed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the proposal is for egress, we should call this out at the top of the document, especially because it's either that or "zero trust".


#### Agent Identity

As an AI Engineer, I want to assign a unique, verifiable identity to my agent running in Kubernetes, so that gateways or external systems can securely authenticate it and make authorization decisions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This user journey doesn't seem to be directly covered by this proposal. Or is it?

It's more like suggesting that, by only allowing declaring permissions for Service Accounts and SPIFFE IDs, those should be, indirectly, assumed the ways to represent identity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see "assigning a unique, verifiable identity to an agent" as incompatible with authentication being a non-goal.

Type *BackendType `json:"type"`
// MCP defines a MCP backend.
// +optional
MCP MCPBackend `json:"mcp,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Backend API is representative of egress gateways, a use case that transcends agentic, and AI in general. Having MCPBackend makes this more specifically an agentic AI API.

I expect this capability belongs in Gateway API itself. If we move forward with this API here, how do we see threading the needle with all the other use cases and stakeholders for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaneutt , do you think we should have a GEP for the Backend CRD introduced in this PR?

cc @robscott

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I had planned to base the Backend in this proposal on what you have here in order to avoid repeating work. A GEP would make my life a bit easier in that regard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a GEP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing feedback and discussions we had over kubecon gwapi break room sessions.

We agreed that Backend probably belongs to Gateway API. We also agreed Agentic net will experiment with this, in its own pace and will move this to a GEP once we have something we feel is implementable.

As shane said as well, it will likely require collaboration with other WGs and Gateway API to make sure its the "right" backend that fits more generic needs, but we should not hold iteration and prototyping at this point of the project.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In agentic network we have three primary backends: LLMs(cloud or on prem hosted), MCP and Agent as an agent can call LLMs, tools or other agents. In envoy ecosystem, we have alpha version of Backend implemented in envoy gateway project and in envoy ai gateway project we introduced AIServiceBackend for LLM egress cases. I think there are two options:

  1. Add a generic Backend in Gateway API and here we define AgenticBackend to have a reference to Backend
  2. Composite backend like the way defined here, but allow defining MCP, LLM, Agent specific backend fields.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prototyping and iterating on a single CRD with a type field would be quicker than introducing additional CRDs at this stage.
So, 2 is my recommendation.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 10, 2025
@haiyanmeng haiyanmeng force-pushed the toolauthapi branch 6 times, most recently from 32a24db to 3b898c3 Compare November 11, 2025 22:01

const (
// ActionAllow allows requests that match the policy rules.
ActionAllow AuthPolicyAction = "ALLOW"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deny lists can be convenient in some use cases but are also a bad practice security-wise. IMO, we should not encourage it.

My recommendation is to drop this field altogether and support only allow lists at the beginning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the sentiment, but I think this structure is more extensible for if we do want to add a deny list mode in the future (compared to having to introduce a new field and deal with conflicts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIAC, adding a new field generates no conflict, as long as it's an optional field and it defaults to ALLOW. However, I'm more interested in why we would want to add deny list mode in the future. That would be a hard decision to say the least, picking convenience over security. Or perhaps you're envisioning some other feature @keithmattix?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on the new field; not as difficult as I originally thought.

Re: denylist - I think it's misleading to think of all security best practices in absolutes, especially in API design. A deny list, while not ideal, is often the best certain organizations can do at a given moment, and it's better for them to have some traffic paths blocked vs. leaving everything open until every known good path can be ascertained. Organizations won't risk an outage to adopt a default deny posture before they're ready

Comment on lines +516 to +525
- apiGroups: ["agentic.networking.x-k8s.io"]
resources: ["backends"]
resourceNames: ["mcp-server2"]
verbs: ["read_wiki_structure"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this would work, the resource doesn't have to be a kind known by the API server. In this particular example, tool/call is implicit. Another option could be:

- apiGroups: ["agentic.networking.x-k8s.io"]
  resources: ["mcptools"]
  resourceNames: ["mcp-server2/read_wiki_structure"]
  verbs: ["call"]


#### **Latency (The `ext_authz` Hop)**

Because it relies on Envoy's *external authorization* API, every request that hits your Gateway must pause, make a network hop to the Authorino service, wait for a decision, and then resume.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I know at a glance this is not obvious, but rather hidden as an implementation detail of Kuadrant. Although Authorino indeed implements Envoy ext_authz and could be integrated directly from an HTTPRoute, when configured via a Kuadrant AuthPolicy, it does not leverage the Envoy ext_authz filter. Instead, Kuadrant makes the call (or not) to Authorino from a wasm module that runs in the same process as the proxy itself. Therefore, the extra hop is not always needed, thus the latency issue due to an extra hop is true, just not at "every request".

This proposal aims to address the user journeys described in
kubernetes-sigs#5
except for the AuthScheme CRD

The AuthScheme CRD can be added in a followup PR.
We will address these two areas in a follow-up PR

The CRD names may change depending on the OSS feedback.

> **_NOTE:_** The API does not cover identity extraction, or request authentication. We will cover them in a follow-up Pull Request.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guicassolato , I incorporated all your suggestions in haiyanmeng#1 except for the AuthScheme CRD into this PR, and added this note.

After this PR is merged, you can create a PR against https://github.com/kubernetes-sigs/kube-agentic-networking for the AuthScheme CRD. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep both APIs in a unified proposal as we originally discussed. Here's my reasoning: The two APIs address different use cases and architectural patterns. I believe presenting them together allows implementations to evaluate which approach (or both) fits their needs, rather than prescribing one path forward.

Specific concerns with AccessPolicy as a standalone solution: For dynamic environments (like MCP with Dynamic Client Registration and OAuth2), the AccessPolicy pattern requires constantly reconciling a potentially large policy resource as principals change. This creates scalability challenges with:

  • Large policy objects being frequently updated
  • Potential conflicts during concurrent reconciliation
  • Storage/etcd pressure for environments with many dynamic principals

Why AuthScheme complements this: AuthScheme addresses these concerns by delegating to external trust sources (OIDC providers, Kubernetes RBAC) rather than storing all permission data in-cluster. The CEL-based extraction and verification model supports:

  • Service Account token validation for Kube-native apps
  • OIDC federation for dynamic client scenarios
  • Kubernetes RBAC for environments preferring existing authorization mechanisms
  • In-policy pattern-matching authorisation rules for things like JWT claim checks

I see value in both approaches for different contexts, which is why I'd advocate for landing them together. That said, I recognize this is a community decision. Is there a specific concern about including AuthScheme in this PR that we could address?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guicassolato , thanks for sharing your thought. I do see the value of having an AuthScheme resource. My main reasons for introducing AuthScheme in a follow-up PR are:

  • At this point the community seems okay with Backend and AccessPolicy (formerly named AuthPolicy). However, AuthScheme is new, and it may take some time for the community to reach an agreement on its design details.
  • Putting AuthScheme in a separate PR allows higher velocity. If there is feedback from the community to address, you can address it directly without going through https://github.com/haiyanmeng/kube-agentic-networking.

Copy link
Member

@LiorLieberman LiorLieberman Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates scalability challenges with:

I dont think we should be bothered with scalability at this stage. We are in a prototyping phase. I would like to understand more the potential conflicts argument though.

The AuthScheme, while robust is very opinionated way of doing something, and its complex. I do want to have more discussion on that in a wider forum.

I think its a win that we have some consensus on "Backend" and "AccessPolicy" (and thanks for the awesome changes you suggested) but my real concern is that we start with something very complex to begin with. I'd rather tackle it one by one.

Backend is already a new type folks needs to be doing, and AccessPolicy is another one.

Anyway, I guess the summary of what I say is, I do understand AuthScheme place, but I would love to have that as a fast follow thing to discuss and see exactly the user flows we are covering and how. We spent a bit amount of time discussing Backend and AccessPolicy (or formerly the ToolAuthPolicy) to gather the feedback but we havent done this for AuthScheme. And I do think its important to allow a UX where users dont need an AuthScheme, rather they can use it (or some other mechanism, TBD based on discussions+feedback) to customize/enhance what they need.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 I've been lurking for a while with some ideas, but I just carved out some time to read the entire proposal and comments, so apologies for popping in randomly!

As a quick middle-ground proposal for the authN question: instead of defining the whole extraction flow in this PR, we could just define the canonical identity formats that AccessPolicy should match, similar to how fine-grained authorization systems like Zanzibar handle identities from different sources.

e.g., these three can all refer to the same workload:

  • spiffe://cluster.local/ns/default/sa/agent

  • { namespace: "default", name: "agent" }

  • { iss: "https://issuer.example.com", sub: "agent@default" }

which can be normalised to something like: sa:default/agent

So we could define formats such as sa:<ns>/<name>, spiffe:<uri>, oidc:<issuer>/<sub>, external:<id>, and leave the details of how those identities get extracted to a follow-up proposal. Without this, two implementations could evaluate AccessPolicy against different identity strings for the same principal, causing potential consistency violations and policy drift.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's discuss the AuthScheme in haiyanmeng#1. I'm very interested in your points regarding it being opinionated and complex @LiorLieberman.

Comment on lines +95 to +108
ServiceName string `json:"serviceName,omitempty"`
// Hostname defines the hostname of the external MCP service to connect to.
// +optional
Hostname string `json:"hostname,omitempty"`
// Port defines the port of the backend endpoint.
// +required
Port int32 `json:"port"`
// Path is the URL path of the MCP backend for MCP traffic.
// A MCP backend may serve both MCP traffic and non-MCP traffic.
// If not specified, the default is /mcp.
// +optional
// +kubebuilder:default:=/mcp
Path string `json:"path,omitempty"`
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these fields are not MCP specific, it can be used by other agentic backends like LLM or agents

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So move into BackendSpec?
Abstract when needed as mcp specifics arise.

Maybe something like MCP protocol version could be useful, but I'd prefer to have a concrete use case.

}

// AccessPolicySpec defines the desired state of AccessPolicy.
type AccessPolicySpec struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like authorization policy which can be solved by ext authz using OPA which is far more expressibe and flexible than defining k8s resource here?


This proposal defines authorization policies for tool access from AI agents running inside a Kubernetes cluster to MCP servers running in the Kubernetes cluster or outside of the Kubernetes cluster. By default, an AI agent can call initialize, notifications/initialized and tools/list. To enforce a "zero trust" security posture, a tools/call is denied unless it is allowed through the Tool Auth API described in this proposal.

# ⚠️ Warning: Experimental API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ⚠️ Warning: Experimental API
# 🚫🚫 **STOP – EXPERIMENTAL API** 🚫🚫
**Do NOT use this in production.**

Making it more obvious not to use this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.