Skip to content

feat(tencent): add Tencent Cloud provider#165

Merged
rafeegnash merged 13 commits into
bgdnvk:masterfrom
rephapeng:pr/tencent-provider
May 25, 2026
Merged

feat(tencent): add Tencent Cloud provider#165
rafeegnash merged 13 commits into
bgdnvk:masterfrom
rephapeng:pr/tencent-provider

Conversation

@rephapeng
Copy link
Copy Markdown
Contributor

@rephapeng rephapeng commented May 18, 2026

Summary

Adds a Tencent Cloud provider alongside the existing AWS / GCP / Azure / Cloudflare / Fly / Verda / Vercel / Railway providers. Built over the past month in 15 phases on my fork; this PR consolidates the work as one contribution.

Opened as Draft so you can react before committing review time. Happy to split into multiple PRs (provider core / HTTP API / Maker integration) if that's preferred.

Coverage

  • Computecvm + lighthouse (lightweight cloud server)
  • Networkvpc, subnet, security-group + rule audit, eip, clb, nat, vpn, ccn, direct-connect
  • Storagecbs (Cloud Block Storage), cos (Object Storage)
  • Databasemysql (cdb), postgres, redis, mongodb, cynosdb (tdsql-c)
  • Container — TKE clusters + kubeconfig fetch
  • Edgecdn, edgeone, waf, anti-ddos
  • Identitycam users
  • Observability — Cloud Monitor metrics (CVM + Lighthouse), CLS log topics, Cloud Audit tracks, alarm policies
  • Cost — monthly billing by product + top-N resources
  • Tags — surfaced on summary structs via a reflection helper that handles the SDK's inconsistent tag-field names (Tag{Key,Value}, ResourceTag{TagKey,TagValue}, TagSet, TagList, ...)

Security audits (10)

public-cvm-exposure, clb-exposure, db-exposure, idle-eips, unencrypted-cbs, cert-expiry, cam-hygiene, waf-coverage, antiddos-coverage, audit-coverage.

Maker integration

  • tencent-api verb — 5-arg [tencent-api, service, action, region, json-params] dispatches via Tencent SDK's CommonRequest over signed transport. No tccli dependency.
  • filter verb (new)[filter, sourceIdx, arrayPath, field, op, value] post-processor. Operators: > < >= <= == != contains startsWith matches. Lets Maker answer "find X by criteria" queries directly instead of returning full inventory.
  • [*] array placeholdersjsonPathString in internal/maker/exec.go now handles $.X[*].Y wildcard paths, binding the placeholder to a JSON array literal so chains like "InstanceIds":<CVM_IDS> work.
  • Action denylistknownHallucinatedActions catches common LLM-invented action names with "did you mean..." hints before round-tripping to Tencent.
  • Prompt — chain shapes A–H, anti-patterns, static-spec vs runtime metrics rules.

HTTP API

Hooks into the existing clanker server route table with bearer-auth endpoints under /api/v1/tencent/* (inventory, scans, monitoring, cost, topology). Maker plan/apply share the existing endpoints.

Credentials

Resolved in this order:

  1. viper tencent.{secret_id,secret_key,region}
  2. TENCENTCLOUD_SECRET_{ID,KEY} + TENCENTCLOUD_REGION (official Tencent SDK env names)
  3. TENCENT_SECRET_{ID,KEY} + TENCENT_REGION (short aliases)

Default region: ap-singapore.

Stats

50 files changed, 9112 insertions(+), 4 deletions(-) — almost entirely additive. Modifications are limited to: internal/maker/exec.go ([*] helper), internal/maker/exec_tencent_filter.go (new filter verb used by other providers too), and registration in cmd/{ask,root,server}.go.

Test plan

  • Build green on Go 1.25
  • Smoke: clanker tencent cvm --region ap-singapore with valid creds
  • HTTP: GET /api/v1/tencent/resources/cvm?region=ap-singapore returns inventory
  • Maker: clanker ask --maker "list CVMs with >2 vCPUs in ap-jakarta" produces a 2-cmd plan (DescribeInstances + filter)
  • Security scan: clanker tencent scan public-exposure --region ap-singapore

@rephapeng rephapeng marked this pull request as ready for review May 18, 2026 14:23
@bgdnvk
Copy link
Copy Markdown
Owner

bgdnvk commented May 23, 2026

Thanks for the PR, will review soon :)

Copy link
Copy Markdown
Collaborator

@rafeegnash rafeegnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge body of work here — 10K lines across 15 phases is real engineering, and the Tencent provider integration is in great shape overall. The lazy per-service SDK clients, the tencent-api + filter maker verbs, the knownHallucinatedActions map, the credential resolution chain — all thoughtful and a clean fit with the codebase's conventions.

We'd love to take you up on your offer to split into smaller PRs — at this scale, three focused PRs would make this much easier to review carefully and give you faster turnaround on the parts that are already solid:

  1. Provider coreinternal/tencent/* (the SDK client, per-service modules, security scans, Cobra commands, credential resolution). This is the strongest piece and could land first with minimal back-and-forth.
  2. Maker integrationinternal/maker/exec_tencent*.go, tencent_prompts.go, the [*] placeholder change in exec.go. Most of our review notes are here.
  3. HTTP APIinternal/api/* plus the cmd/server.go wiring. This is the layer that needs the most careful auth hardening (see notes below).

Totally fine if splitting isn't realistic — we can keep reviewing this PR as-is, just wanted to flag that the offer is welcome.

A few things to flag before re-review:

Branch is out of date

The merge base is 3e65a24 (feat: mcp #164, from late April). Since then master has merged #169#174 (k8s helm, exec/apply/port-forward, node lifecycle, create aks, native MCP K8s tools, conversation history cap, SRE playbook framework). The diff currently shows those files as deletions — that's purely a rebase-conflict artifact, not anything you intended. When you get a chance, could you pull master and rebase? The conflict surface should mostly be cmd/root.go and cmd/ask.go where the new K8s subcommands register alongside what you've added.

Security notes on the HTTP API

These are the items we'd want fixed before the HTTP API piece lands (whether as part of this PR or in the split):

  • internal/api/middleware.go:43 — token comparison should use crypto/subtle.ConstantTimeCompare to avoid timing-attack token recovery. Two-line fix, see inline comment.
  • internal/api/server.go:67-69 — empty --token currently disables auth on every route including POST /api/v1/maker/apply. It only logs a warning, and the first help-text example shows running with no token, which feels like a footgun. Could we either refuse to start without a token (unless --insecure is explicit), or at least gate POST /maker/apply when the token is unset?
  • internal/maker/exec_tencent_filter.go:182regexp.Compile(value) on user-supplied patterns has a ReDoS risk ((a+)+$ style patterns can spin a goroutine). A pattern-length cap + per-regex deadline (or regexp2 with timeout) would close this.

High-severity items worth addressing

  • internal/api/routes.go:273 — the ?region= query param goes verbatim into the SDK client. Pinned to *.tencentcloudapi.com so not full SSRF, but enables enumeration and potentially unintended API charges. A simple regex validation (^ap-[a-z]+-[0-9]*$) or matching against ListAllRegions would tighten this.
  • internal/api/server.go:48CORSOrigin defaults to *. Bearer auth via header mitigates CSRF, but a http://localhost default with explicit override feels safer.
  • internal/tencent/raw.go — the destructive denylist is prefix-based (Terminate|Delete|Destroy|Reset|Release|Discontinue), so CAM actions like AddUser, CreateAccessKey, AttachUserPolicy slip through without needing --destroyer. If LLM prompts could ever be influenced by fetched data, this widens the blast radius. Would you be open to an allowlist of read-only actions for non-destroyer mode?

Should-fix before merge

  • internal/tencent/context.goctx is accepted on the gather functions but the Tencent SDK has no WithContext variants, so the context is effectively dead. Ctrl-C during clanker ask --tencent won't interrupt anything in-flight. Either a comment explaining the limitation or wiring HttpProfile.ReqTimeout per-call would help.
  • internal/tencent/context.goGetRelevantContext hardcodes Limit = 100 and silently truncates. Accounts with more than 100 of any resource type send incomplete data to the LLM. The CLI list commands paginate correctly — could the context-gather layer do the same, or at least log (showing first 100) like the AWS path does?
  • internal/maker/exec_tencent_filter.govalidateFilterCommand and filterMatch are non-trivial decision-tree code with no tests. A table-driven test would help guard against regressions.

Nits / parity notes

  • internal/tencent/client.go lacks a NewClientWithBackendCredentials factory like other providers have — fine for now since Tencent isn't wired into the backend credential flow yet, but worth adding for consistency.
  • internal/api/routes.go:104-128 calls GetRelevantContext, discards the result, then re-calls gatherTencentByType — doubled API calls per request.
  • Credentials struct (client.go:18) has no String() / MarshalJSON redaction. AWS has the same gap, but one %+v away from leaking SecretKey.
  • internal/tencent/raw.go:80paramsJSON is unmarshalled with no per-field size cap. Effective cap is the 1 MiB body limit; 64 KiB would be plenty.
  • No MCP tools for Tencent — looks intentional for this PR's scope; mentioning so reviewers don't wonder.

What we really like

  • Lazy per-service SDK clients match the Tencent SDK's design — clearly justified divergence from AWS's eager init.
  • Generated files have the correct Code generated by ...; DO NOT EDIT. headers and the generator uses //go:build ignore.
  • The reflection-based tag extractor in tags.go is bounded — no recursion, no panic risk on unusual SDK shapes.
  • The --destroyer gate follows the existing maker convention.
  • Security audits use only describe APIs — no port scanning of discovered endpoints.
  • knownHallucinatedActions with "did you mean" hints is a nice UX touch for LLM-typo'd actions.

Happy to re-review piece by piece as the splits land (or this PR after a rebase), whichever is easier for you. Thanks again for taking this on.

Comment thread internal/api/middleware.go Outdated
}
auth := r.Header.Get("Authorization")
const prefix = "Bearer "
if !strings.HasPrefix(auth, prefix) || strings.TrimSpace(auth[len(prefix):]) != s.cfg.Token {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical — non-constant-time token comparison.

This is a timing side-channel an attacker can use to recover the token byte-by-byte by measuring response times across many requests. Practical over LAN, sometimes even over the open internet.

Suggested fix:

import "crypto/subtle"

// ...
provided := []byte(strings.TrimSpace(auth[len(prefix):]))
expected := []byte(s.cfg.Token)
if subtle.ConstantTimeCompare(provided, expected) != 1 {
    writeError(w, http.StatusUnauthorized, "unauthorized", "missing or invalid bearer token")
    return
}

(ConstantTimeCompare returns 0 if the lengths differ as well, so this also covers the length-equal-but-content-mismatch case.)

Comment thread internal/api/server.go
// Run starts the HTTP server and blocks until ctx is cancelled or
// ListenAndServe returns an error.
func (s *Server) Run(ctx context.Context) error {
if strings.TrimSpace(s.cfg.Token) == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical — empty token disables auth on every route, including POST /maker/apply.

With cfg.Token == "" the middleware short-circuits and lets every request through. That includes plan execution against live Tencent credentials. The first example in the help text shows clanker server --port 8080 with no token, so this is what operators will run by default in dev — and a malicious page loaded on the same machine could then call POST /maker/apply cross-origin (CORS defaults to *).

Could we either:

  • Refuse to start without a token unless an explicit --insecure flag is passed, or
  • At minimum, block POST /maker/apply (and any other mutating route) when the token is unset, even if read-only routes stay open?

The one-line WARNING log is easy to miss in a busy terminal.

if !ok {
return false
}
re, err := regexp.Compile(value)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical — ReDoS via user-supplied regex.

value comes from the plan JSON, which in the HTTP path comes from the POST /api/v1/maker/apply body. A pattern like (a+)+$ against a long string will spin a goroutine that context.Done() cannot preempt — Go's regexp package doesn't support deadline-based cancellation.

Options:

  1. Pattern-length cap + complexity heuristic — reject patterns over ~256 chars or containing nested quantifiers ((a+)+, (a*)*, (a|a)*).
  2. regexp2 package — drop-in replacement with MatchTimeout support.
  3. Pre-validate via regexp/syntax.Parse and walk the AST rejecting unbounded backtracking.

Option 1 is the smallest change. Option 2 is most robust if you're OK adding a dep.

At minimum, bound the string length being matched against — if s is a huge SDK response field, the worst-case input grows.

Comment thread internal/api/routes.go
// tencentClient builds a Tencent client for this request. The region can be
// overridden per request via ?region=ap-jakarta; otherwise the daemon's
// default (resolved from config / env at startup) is used.
func (s *Server) tencentClient(r *http.Request) (*tencent.Client, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — ?region= query param accepted without validation.

The value is passed verbatim into the Tencent SDK as the region, which becomes part of the request signing and the endpoint hostname. The hostname is pinned to *.tencentcloudapi.com by the SDK so it's not arbitrary SSRF, but a caller can:

  • Enumerate which regions the credential has access to via differential responses.
  • Cause unintended API charges by looping through bogus region strings.

Validate against ^[a-z]+-[a-z]+(-[0-9]+)?$ or check membership in ListAllRegions(). Cheap to add.

Comment thread internal/api/server.go Outdated
cfg.Addr = ":8080"
}
if cfg.CORSOrigin == "" {
cfg.CORSOrigin = "*"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — CORS default is *.

Bearer auth via Authorization header mitigates classic CSRF (browsers won't auto-send), but a wildcard default still feels too permissive — any origin can read responses on a successful auth, and combined with the empty-token case above, an unset token on a clanker server instance is reachable from any web page.

Suggest defaulting to http://localhost and requiring an explicit --cors-origin to relax it.

@@ -0,0 +1,1467 @@
package tencent
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix — ctx parameter is accepted but never propagated to the Tencent SDK.

The SDK's generated clients don't expose WithContext variants, so passing ctx here is purely cosmetic — a Ctrl-C during clanker ask --tencent won't interrupt anything in flight. The only function that respects the context is contextCOS via its own context.WithTimeout (which would still wait on the SDK call to return).

Two options:

  1. Document the limitation at the top of this file so the next person doesn't expect cancellation to work.
  2. Wire HttpProfile.ReqTimeout on each per-service profile.NewClientProfile() so calls at least have a hard ceiling.

Option 2 is the safer choice — without it a hung Tencent endpoint blocks the gather goroutine indefinitely.

return out.String(), nil
}

// contextCVMs returns a compact JSON array of CVMs in this client's region.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix — silent truncation at Limit = 100.

The gather functions cap at 100 items per resource type and don't paginate. An account with 150 CVMs sees only the first 100 in the LLM's context, with no signal that more exist. The CLI list commands (e.g. internal/tencent/cvm.go) paginate correctly with an offset loop, so the pattern exists in the codebase.

At minimum, surface the truncation: AWS does this with (showing first %d roles) in the error/info path. Ideally, paginate fully (or to a higher cap like 500) so the LLM gets the full picture for at least small-to-medium accounts.

const defaultRegion = "ap-singapore"

// Credentials holds the resolved Tencent Cloud credentials and target region.
type Credentials struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — credentials struct has no redaction.

No String() / MarshalJSON method, so any %+v or %v formatting of Credentials (or any struct that embeds it, like Client) leaks SecretKey into logs. No current call site does this, but it's one careless debug line away. AWS has the same gap, so this is parity — but worth tightening on both sides:

func (c Credentials) String() string {
    return fmt.Sprintf("Credentials{SecretID: %q, Region: %q, SecretKey: <redacted>}", c.SecretID, c.Region)
}

Comment thread internal/api/routes.go Outdated
"data": []interface{}{},
"warning": err.Error(),
})
_ = raw
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — GetRelevantContext result is discarded then gatherTencentByType is called separately, doubling the API calls.

The raw variable above is assigned and then the code path just continues to gatherTencentByType(...), which makes the same SDK calls a second time. Either consume raw here or drop the GetRelevantContext call entirely from this handler.

// validateTencentCommand rejects anything that isn't a well-formed tencent-api
// call. Destructive actions (Terminate*, Delete*, Reset*) are gated behind
// --destroyer to match the policy applied to every other provider.
func validateTencentCommand(args []string, allowDestructive bool) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix — no test coverage for validateTencentCommand or validateFilterCommand.

This is a security-relevant validator (rejects newlines, gates destructive actions, enforces arg-count bounds). A table-driven test covering each error path — empty command, wrong verb, missing service/action/region, newline in args, destructive without --destroyer, oversized arg count — would help prevent regressions. The same applies to filterMatch in exec_tencent_filter.go.

@rephapeng
Copy link
Copy Markdown
Contributor Author

cool thank you for all comment and fixing recommendation I will check it directly

rephapeng added 8 commits May 24, 2026 21:11
Adds a Tencent Cloud provider to the Clanker CLI alongside the existing
AWS/GCP/Azure/Cloudflare/Fly/Verda/Vercel/Railway providers. Built up
in 15 phases over the past month against the bgdnvk/clanker upstream;
this PR consolidates the full provider as one contribution.

Coverage

* Compute      cvm + lighthouse (lightweight cloud server)
* Network      vpc, subnet, security-group + rule audit, eip, clb,
               nat, vpn, ccn, direct-connect
* Storage      cbs (Cloud Block Storage), cos (Object Storage)
* Database     mysql (cdb), postgres, redis, mongodb, cynosdb (tdsql-c)
* Container    tke clusters + kubeconfig fetch
* Edge         cdn, edgeone, waf, anti-ddos
* Identity     cam users
* Observability  cloud monitor metrics (CVM + Lighthouse), cls log
                 topics, cloud audit tracks, alarm policies
* Cost         monthly billing by product + top-N resources
* Tags         flat map[string]string surfaced on all summary
               structs that the SDK returns tags for (CVM, Lighthouse,
               VPC, Postgres); reflection-based helper handles the
               SDK's inconsistent tag-field naming across services.

Security audits

* public-cvm-exposure   CVMs with sensitive ports open to 0.0.0.0/0
* clb-exposure          public CLBs with risky listeners
* db-exposure           managed DBs reachable from the public internet
* idle-eips             EIPs unbound but billed
* unencrypted-cbs       CBS volumes without encryption
* cert-expiry           SSL certs expiring within N days
* cam-hygiene           CAM users missing phone/email
* waf-coverage          CDN/EdgeOne hosts not covered by WAF
* antiddos-coverage     account anti-DDoS posture + per-region targets
* audit-coverage        Cloud Audit tracks status

HTTP API

Hooks into the existing `clanker server` route table with bearer-auth
endpoints for inventory (/api/v1/tencent/resources/{type}), scans
(/api/v1/tencent/scan/{kind}), monitoring (/api/v1/tencent/metrics/
{cvm|lighthouse}), cost (/api/v1/tencent/cost/by-product, /resources),
and topology (/api/v1/tencent/topology). Maker plan and apply share
the existing /api/v1/maker/{plan,apply} endpoints.

Maker integration

* tencent-api verb         5-arg form [tencent-api, service, action,
                           region, json-params] dispatches to a generic
                           SendRaw over Tencent's CommonRequest signed
                           transport. No tccli dependency.
* tencent_prompts.go       planner system prompt with chain shapes A-H,
                           anti-patterns, static-spec vs runtime
                           metrics rules, and the bgdnvk-style filter
                           verb example.
* filter verb              new [filter, sourceIdx, arrayPath, field,
                           op, value] post-processor returns the
                           matching subset of a prior command's output.
                           Operators: > < >= <= == != contains
                           startsWith matches. Lets Maker answer
                           "find X by criteria" queries directly
                           instead of dumping full inventory.
* [*] array placeholders   jsonPathString in internal/maker/exec.go
                           now handles $.X[*].Y wildcard paths,
                           binding to a JSON array literal so
                           "InstanceIds":<CVM_IDS> chains work.
* Action denylist          knownHallucinatedActions catches common
                           LLM-invented Tencent action names with
                           "did you mean..." hints before the round-
                           trip (GetProductMetricData -> GetMonitorData,
                           cvm.ListInstances -> DescribeInstances, etc).

Credentials

Reads in this order: viper tencent.{secret_id,secret_key,region},
then TENCENTCLOUD_SECRET_* (official Tencent SDK env names),
then TENCENT_SECRET_* short aliases. Default region ap-singapore.

Dependencies

* tencentcloud/tencentcloud-sdk-go/{cvm,vpc,cbs,clb,cdb,postgres,
  redis,mongodb,tke,tag,cam,monitor,cls,billing,lighthouse,...}

The provider is fully read-by-default; write operations
(Create/Modify/Run) go through Maker's existing plan+apply gate, with
destructive Terminate/Delete/Reset/Release/Discontinue actions
additionally requiring the existing --destroyer flag.
Tencent's two monitor APIs disagree about the canonical dimension casing
for namespace QCE/LIGHTHOUSE:

  - DescribeBaseMetrics (metadata)  reports "instanceid"  (lowercase)
  - GetMonitorData      (data)      accepts "InstanceId"  (PascalCase)

We trusted DescribeBaseMetrics when building lighthouse.go, which is why
every Lighthouse metric call has been returning Tencent's misleading
"[InvalidParameterValue] : unauthorized operation or the instance has
been destroyed" — same error code Tencent reuses for genuine permission
gaps and lifecycle issues, which sent the debug down two days of false
leads (CAM, agent install, account type, sub-user vs root, ...).

The user's CAM is genuinely AdministratorAccess. The Cloud Monitor
agent is installed. The Tencent Console displays the metrics fine.
The data was always there — Tencent's data API just rejected our
spelling of the dimension name with a wildly inappropriate error code.

PascalCase is the same form CVM uses, so the fix is a one-character
change to the lighthouseDimensionKey constant. Added a comment block
explaining the discrepancy so the next person who reads this code
doesn't repeat the investigation.

Verified live against lhins-fprj6w5h (ap-singapore):
  CpuUsage          0.80%   (avg 0.96%, 59 samples)
  MemUsage         38.79%   (avg 38.72%, 59 samples)
  DiskUsage        22.18%   (60 samples)
  LighthouseOutpkg     1    (avg 1.22, 59 samples)
DescribeBillSummaryByProduct (the per-product cost call) returns
RealCost but no tax field, so Clanker cost totals never matched the
Tencent console tax-inclusive headline. RealCost is total consumption
(voucher + cash + tax); the console headline is cash out of pocket.

Adds billFeeSummary() — calls DescribeCostExplorerSummary with
Dimensions=feeType, FeeType=cost, the only billing API that breaks out
tax. BillByProductJSON now embeds a summary object:

  consumption      total RealCost (voucher + cash + tax)
  voucher          amount covered by vouchers
  cash_before_tax  cash portion, pre-tax
  tax              tax amount
  cash_incl_tax    cash_before_tax + tax (matches console headline)

The Detail item names are localized display strings; the cash line is
"Total Amount After Discount (Excluding Tax)" which contains the word
"tax", so the substring match checks "discount" before "tax" to avoid
misclassifying the cash line as tax.

Verified against a real April 2026 bill:
  consumption    11,146.37
  voucher         5,701.68
  cash_before_tax 4,905.13
  tax               539.56
  cash_incl_tax   5,444.70  = console "Total Cost (Incl Tax)"
- DescribeVoucherInfo / DescribeVoucherUsageDetails: voucher inventory,
  balances, per-voucher usage history, and a per-owner-UIN breakdown of
  voucher spend (nominal - balance). Voucher APIs only answer on the
  account's home region, so they use a region-aware billing client.
- VoucherByOwnerJSON: month-scoped voucher deduction grouped by the
  owner account UIN of each billed resource, via DescribeBillResourceSummary
  (the voucher APIs carry no per-record UIN).
- CLI: `clanker tencent cost vouchers` / `cost voucher-usage`.
- HTTP API: /cost/vouchers, /cost/voucher-usage/{id}, /cost/voucher-by-owner.
Extend the slim JSON shape returned by /api/v1/tencent/resources/* for
every subscription-capable resource: CVM, Lighthouse, CBS, MySQL,
Postgres, Redis, MongoDB, CynosDB, CLB, AntiDDoS. PREPAID entries now
carry the renewal deadline so callers can see what is about to expire
without a separate billing call.

Tencent uses two billing-mode conventions (string vs int) and the int
form has NO consistent mapping across services — CDB inverts vs the
others. internal/tencent/charge_mode.go centralizes the normalization
into 'PREPAID' / 'POSTPAID' strings so the JSON shape is uniform.
The maker plan executor (SendRaw) gated calls behind a hand-maintained
service map in raw.go, which silently lagged behind the SDK — calling
lighthouse.DescribeInstances failed with 'unsupported tencent service'
even though every other code path (typed clients, dashboard) handled
it fine. Eight other services (antiddos, billing, cdn, cloudaudit,
cynosdb, dc, ssl, teo, waf) were also missing from the map.

Replace the static map with a go:generate-driven one. gen_services.go
walks GOMODCACHE/.../tencentcloud-sdk-go/tencentcloud/*/v* and emits
service_versions_gen.go with one entry per service (latest version
when multiple are vendored). The cos sentinel is preserved via a
manualOverrides map in the generator. Upgrading the SDK now just needs
'go generate ./internal/tencent/...'.

The error message in SendRaw now enumerates services from the generated
map so it stays accurate.
Add auto_renew (*bool, omitempty) to the slim JSON returned by
/api/v1/tencent/resources/* for every prepaid-capable resource where
the SDK exposes the renew flag: CVM, Lighthouse, CBS, MySQL, Postgres,
Redis, CynosDB, and CLB. Consumers can now check whether an expiring
resource will auto-renew or needs manual action — without the field,
expiring auto-renewers would generate false-positive alerts.

Skipped MongoDB and AntiDDoS — their list endpoints don't expose the
renew flag (would require a separate DescribeAutoRenew-style call).

Tencent's renew encoding has three flavors handled by new normalizers
in charge_mode.go: string 'NOTIFY_AND_AUTO_RENEW' (CVM/CBS/Lighthouse)
and 'AUTO_RENEW' (CLB nested in PrepaidAttributes), int64 1 (CDB/Redis/
CynosDB), uint64 1 (Postgres). Using *bool + omitempty so consumers can
distinguish 'not on auto-renew' from 'no info available'.
Three critical findings from rafeegnash's review.

1. Constant-time token comparison (api/middleware.go)
   The previous '!=' compare leaked information through response timing,
   letting an attacker recover the bearer token byte-by-byte by measuring
   latency across many requests. Switch to crypto/subtle.ConstantTimeCompare.

2. Refuse to start without a token (api/server.go, cmd/server.go)
   An empty --token previously disabled auth on every route, including
   POST /api/v1/maker/apply which can mutate real cloud resources. The
   server now aborts startup unless --insecure is explicit (or
   CLANKER_API_TOKEN is set). Help text now shows the token-gated
   invocation as the default example.

3. Bound the filter 'matches' operator (maker/exec_tencent_filter.go)
   regexp.Compile took an unbounded user-controlled pattern via the
   maker plan. Go's RE2 is linear so a true ReDoS is not realistic, but
   we now cap pattern length at 256 chars and run MatchString under a
   100ms wall-clock deadline as defense-in-depth — the filter value
   originates from LLM output and reaches us through the HTTP API, so
   tighter bounds keep that surface predictable.
@rephapeng rephapeng force-pushed the pr/tencent-provider branch from 0e6182e to 1bbb914 Compare May 24, 2026 14:17
rephapeng added a commit to rephapeng/clanker that referenced this pull request May 24, 2026
Three critical findings from rafeegnash's review.

1. Constant-time token comparison (api/middleware.go)
   The previous '!=' compare leaked information through response timing,
   letting an attacker recover the bearer token byte-by-byte by measuring
   latency across many requests. Switch to crypto/subtle.ConstantTimeCompare.

2. Refuse to start without a token (api/server.go, cmd/server.go)
   An empty --token previously disabled auth on every route, including
   POST /api/v1/maker/apply which can mutate real cloud resources. The
   server now aborts startup unless --insecure is explicit (or
   CLANKER_API_TOKEN is set). Help text now shows the token-gated
   invocation as the default example.

3. Bound the filter 'matches' operator (maker/exec_tencent_filter.go)
   regexp.Compile took an unbounded user-controlled pattern via the
   maker plan. Go's RE2 is linear so a true ReDoS is not realistic, but
   we now cap pattern length at 256 chars and run MatchString under a
   100ms wall-clock deadline as defense-in-depth — the filter value
   originates from LLM output and reaches us through the HTTP API, so
   tighter bounds keep that surface predictable.
@rephapeng
Copy link
Copy Markdown
Contributor Author

Thanks again @rafeegnash for the thorough review. First pass landed — 1bbb914.

Rebase

Branch is now rebased onto current upstream/master (d4d94f9). The false-deletion diff from the stale 3e65a24 base is gone. Conflicts didn't materialize in the end — cmd/root.go / cmd/ask.go reconciled cleanly with the new k8s subcommands.

Three critical security findings — all fixed

  • internal/api/middleware.go:43 — now crypto/subtle.ConstantTimeCompare with a length check first. Comment explains why leaking length is acceptable here (token length is server-startup config, never user-controlled).
  • internal/api/server.go:67 — server now refuses to start without a token. Added --insecure flag for the trusted-network/localhost case; help text now leads with the token-gated example instead of the open one.
  • internal/maker/exec_tencent_filter.go:182 — pattern length capped at 256, MatchString runs under a 100 ms wall-clock deadline via goroutine+select. Inline comment notes Go's RE2 makes a true ReDoS unrealistic but bounds the surface for any future engine swap.

Next

Working through the High / Should-fix items now (region validation, CORS default, destructive-action allowlist, ctx propagation, pagination, filter tests). Will push those as a separate commit so re-review can split.

On the PR split — let me get the security pass done first, then I'll come back to you on whether to split. Some of the High-severity items overlap maker + API so the boundary isn't as clean as it looked initially.

rephapeng added a commit to rephapeng/clanker that referenced this pull request May 24, 2026
Three high-severity items from rafeegnash's review.

1. Validate ?region= before it reaches the Tencent SDK (api/routes.go)
   Previously any string was passed verbatim into the client's region —
   enabling enumeration of arbitrary Tencent endpoints and potentially
   driving unintended API charges. Added a regex covering all current
   region prefixes (ap|na|eu|sa|cn)-name(-suffix)?, with a typed
   *errInvalidRegion so the handler chokepoint surfaces 400 instead of
   the catch-all 401. 23 handler call sites switched to a small
   writeTencentClientErr helper.

2. CORS default no longer wildcard (api/server.go, cmd/server.go)
   Default Access-Control-Allow-Origin was "*", letting any page read
   API responses. Bearer auth via header mitigates CSRF but a hostile
   origin could still siphon data when a user pastes their token there.
   Default is now http://localhost:4173 (the bundled dashboard); pass
   --cors-origin explicitly for non-localhost deployments.

3. Switch destructive check to read-only allowlist (maker/exec_tencent.go)
   isTencentDestructive previously prefix-matched only Terminate|Delete|
   Destroy|Reset|Release|Discontinue. CAM mutations like AddUser,
   CreateAccessKey, AttachUserPolicy slipped through without --destroyer.
   Flipped to an allowlist of read-only verb prefixes (Describe, Get,
   List, Query, Lookup, Search, Check, Inquiry). Anything else now
   requires --destroyer — fail-safe by default.

   Behavior change to be aware of: verbs that don't match a read prefix
   (Create*, Run*, Add*, Modify*, Set*, Enable*, Bind*, Associate*,
   Allocate*, etc.) now require --destroyer. ResetInstancesPassword was
   previously whitelisted as 'only changes the password' — that
   whitelist is gone because a password reset locks out anyone using
   the previous credential, which is a security-affecting mutation.
Three high-severity items from rafeegnash's review.

1. Validate ?region= before it reaches the Tencent SDK (api/routes.go)
   Previously any string was passed verbatim into the client's region —
   enabling enumeration of arbitrary Tencent endpoints and potentially
   driving unintended API charges. Added a regex covering all current
   region prefixes (ap|na|eu|sa|cn)-name(-suffix)?, with a typed
   *errInvalidRegion so the handler chokepoint surfaces 400 instead of
   the catch-all 401. 23 handler call sites switched to a small
   writeTencentClientErr helper.

2. CORS default no longer wildcard (api/server.go, cmd/server.go)
   Default Access-Control-Allow-Origin was "*", letting any page read
   API responses. Bearer auth via header mitigates CSRF but a hostile
   origin could still siphon data when a user pastes their token there.
   Default is now http://localhost:4173 (the bundled dashboard); pass
   --cors-origin explicitly for non-localhost deployments.

3. Switch destructive check to read-only allowlist (maker/exec_tencent.go)
   isTencentDestructive previously prefix-matched only Terminate|Delete|
   Destroy|Reset|Release|Discontinue. CAM mutations like AddUser,
   CreateAccessKey, AttachUserPolicy slipped through without --destroyer.
   Flipped to an allowlist of read-only verb prefixes (Describe, Get,
   List, Query, Lookup, Search, Check, Inquiry). Anything else now
   requires --destroyer — fail-safe by default.

   Behavior change to be aware of: verbs that don't match a read prefix
   (Create*, Run*, Add*, Modify*, Set*, Enable*, Bind*, Associate*,
   Allocate*, etc.) now require --destroyer. ResetInstancesPassword was
   previously whitelisted as 'only changes the password' — that
   whitelist is gone because a password reset locks out anyone using
   the previous credential, which is a security-affecting mutation.
@rephapeng
Copy link
Copy Markdown
Contributor Author

High-severity tier landed in 2a18f7b on top of the security commit.

?region= validation (internal/api/routes.go)

  • Regex ^(ap|na|eu|sa|cn)-[a-z]+(-[a-z]+)?$ — covers all current region prefixes including 3-segment ones like ap-shanghai-fsi. Region NAMES never contain digits (those are zone-level), so numeric segments are correctly rejected.
  • New typed *errInvalidRegion; chokepoint helper writeTencentClientErr returns 400 + invalid_region for validation failures vs the existing 401 + tencent_credentials for credential failures. 23 handler call sites switched over.
  • Verified: SQLi-style strings, unknown prefixes, and zone-with-digits all return 400 with a clear message; valid regions (ap-jakarta, ap-singapore, ap-shanghai-fsi) return 200.

CORS default (internal/api/server.go, cmd/server.go)

  • Default Access-Control-Allow-Origin is now http://localhost:4173 (the bundled dashboard). --cors-origin flag remains for explicit override; help text updated to discourage *.

Destructive denylist → read-only allowlist (internal/maker/exec_tencent.go)

  • Allowlisted prefixes: Describe, Get, List, Query, Lookup, Search, Check, Inquiry. Anything else now requires --destroyer.
  • Verified end-to-end through clanker_tencent_maker_apply: all three CAM cases you flagged (AddUser, CreateAccessKey, AttachUserPolicy) are now blocked without --destroyer. Reads (DescribeInstances, GetMonitorData, ListUsers, SearchLog) still pass. Previously-allowed mutations (RunInstances, ModifyVpcAttribute, AssociateAddress) now also need --destroyer.
  • Removed the ResetInstancesPassword whitelist — locking out the previous credential is a security-affecting mutation, so it now needs --destroyer too.

Next batch (Should-fix tier) will be: ctx propagation, pagination beyond 100, and tests for validateFilterCommand / filterMatch.

rephapeng added a commit to rephapeng/clanker that referenced this pull request May 25, 2026
Three items from rafeegnash's review.

1. ctx propagation + per-request timeout (tencent/profile.go, all clients)
   The Tencent SDK has no WithContext variants — caller ctx cancellation
   cannot interrupt a request in flight. As a defense, every typed client
   now flows through newClientProfile(endpoint), which sets
   HttpProfile.ReqTimeout = 30s. Combined with a ctxDone() check between
   pagination pages and between GetRelevantContext sections, Ctrl-C now
   bounds the wall-clock cost of cancellation to the single in-flight
   SDK call (was: indefinite).

   All 22 profile.NewClientProfile() + Endpoint sites were converted to
   the helper; the now-unused profile import was stripped from those files.

2. Paginate GetRelevantContext past 100 (tencent/context.go)
   contextCVMs, contextVPCs, contextSecurityGroups now loop through pages
   with offset/limit until TotalCount is exhausted, capped at
   gatherMaxItems (1000) with a logGatherTruncated() warning when the cap
   fires. These are the highest-cardinality types — production accounts
   commonly cross 100 here. Other gather functions still single-call at
   limit=100; bringing them up is mechanical follow-up.

3. Tests for filter validator and matcher (maker/exec_tencent_filter_test.go)
   Table-driven coverage for validateFilterCommand (arg count, sourceIdx,
   op enum) and filterMatch (every operator + every JSON value type).
   Includes the ReDoS-defense cases added in the critical-tier commit:
   oversize pattern returns false, malformed regex returns false, PCRE-
   style catastrophic-backtrack patterns don't hang.
Three items from rafeegnash's review.

1. ctx propagation + per-request timeout (tencent/profile.go, all clients)
   The Tencent SDK has no WithContext variants — caller ctx cancellation
   cannot interrupt a request in flight. As a defense, every typed client
   now flows through newClientProfile(endpoint), which sets
   HttpProfile.ReqTimeout = 30s. Combined with a ctxDone() check between
   pagination pages and between GetRelevantContext sections, Ctrl-C now
   bounds the wall-clock cost of cancellation to the single in-flight
   SDK call (was: indefinite).

   All 22 profile.NewClientProfile() + Endpoint sites were converted to
   the helper; the now-unused profile import was stripped from those files.

2. Paginate GetRelevantContext past 100 (tencent/context.go)
   contextCVMs, contextVPCs, contextSecurityGroups now loop through pages
   with offset/limit until TotalCount is exhausted, capped at
   gatherMaxItems (1000) with a logGatherTruncated() warning when the cap
   fires. These are the highest-cardinality types — production accounts
   commonly cross 100 here. Other gather functions still single-call at
   limit=100; bringing them up is mechanical follow-up.

3. Tests for filter validator and matcher (maker/exec_tencent_filter_test.go)
   Table-driven coverage for validateFilterCommand (arg count, sourceIdx,
   op enum) and filterMatch (every operator + every JSON value type).
   Includes the ReDoS-defense cases added in the critical-tier commit:
   oversize pattern returns false, malformed regex returns false, PCRE-
   style catastrophic-backtrack patterns don't hang.
@rephapeng
Copy link
Copy Markdown
Contributor Author

Should-fix tier landed in e5530b3.

ctx + per-request timeout (internal/tencent/profile.go, all clients)
The SDK exposes no WithContext variants so caller ctx cancellation cannot interrupt a request in flight — added a newClientProfile(endpoint) helper that sets HttpProfile.ReqTimeout = 30s and refactored all 22 profile-constructor sites to use it. Also added a ctxDone(ctx) check between pagination pages and between GetRelevantContext sections so Ctrl-C bounds wall-clock cost to one in-flight SDK call instead of indefinite.

Pagination (internal/tencent/context.go)
contextCVMs, contextVPCs, contextSecurityGroups now loop with offset/limit until TotalCount is reached, capped at gatherMaxItems = 1000 with logGatherTruncated() when the cap fires. These were the highest-cardinality types — production accounts cross 100 here most often. The remaining gather functions still single-call at limit=100; bringing them up is mechanical follow-up and I held back to keep this commit reviewable.

Filter tests (internal/maker/exec_tencent_filter_test.go)
Table-driven coverage for validateFilterCommand (arg count, sourceIdx, op enum — both happy paths and each rejection reason) and filterMatch (every operator × representative JSON value types). Also locks in the ReDoS-defense behavior added in the critical-tier commit: oversize patterns return false, uncompilable regexes return false, PCRE-style catastrophic-backtrack patterns don't hang. go test ./internal/maker/ passes.

Remaining from the review:

  • The nits (credential redaction, doubled gather call in routes.go:104-128, NewClientWithBackendCredentials factory, paramsJSON per-field size cap)
  • Optional: paginate the other gather functions

I can roll the nits into a follow-up if you want — they're all tiny and could land together.

rephapeng added a commit to rephapeng/clanker that referenced this pull request May 25, 2026
Four small items from the review.

1. Credentials redaction (tencent/client.go)
   Added String() and MarshalJSON() on Credentials so %v / %+v / Println
   and json.Marshal all render SecretKey as **** instead of leaking the
   raw key. Direct field access (the SDK signature path) is unchanged.

2. Drop doubled gather call (api/routes.go)
   handleTencentResources called GetRelevantContext (full multi-section
   gather) and discarded the result, then called gatherTencentByType for
   the requested type — doubling SDK calls per request. Removed the
   GetRelevantContext call.

3. NewClientWithCredentials factory for parity (tencent/client.go)
   Added BackendTencentCredentials struct and NewClientWithCredentials
   constructor matching the shape AWS / GCP / Fly.io / etc. already use.
   Not wired into the backend credential flow yet — kept for consistency
   so the dispatch layer can treat Tencent the same as every other
   provider.

4. Per-field size cap on paramsJSON (tencent/raw.go)
   maxParamsJSONBytes (256 KiB total) + maxParamsFieldBytes (64 KiB per
   string field, walked recursively into nested maps and slices). The
   effective cap was the 1 MiB HTTP body limit, which is far larger than
   any legitimate Tencent action payload — 64 KiB still fits user-data
   scripts and policy documents while rejecting accidentally-pasted
   dumps from an LLM plan.
Four small items from the review.

1. Credentials redaction (tencent/client.go)
   Added String() and MarshalJSON() on Credentials so %v / %+v / Println
   and json.Marshal all render SecretKey as **** instead of leaking the
   raw key. Direct field access (the SDK signature path) is unchanged.

2. Drop doubled gather call (api/routes.go)
   handleTencentResources called GetRelevantContext (full multi-section
   gather) and discarded the result, then called gatherTencentByType for
   the requested type — doubling SDK calls per request. Removed the
   GetRelevantContext call.

3. NewClientWithCredentials factory for parity (tencent/client.go)
   Added BackendTencentCredentials struct and NewClientWithCredentials
   constructor matching the shape AWS / GCP / Fly.io / etc. already use.
   Not wired into the backend credential flow yet — kept for consistency
   so the dispatch layer can treat Tencent the same as every other
   provider.

4. Per-field size cap on paramsJSON (tencent/raw.go)
   maxParamsJSONBytes (256 KiB total) + maxParamsFieldBytes (64 KiB per
   string field, walked recursively into nested maps and slices). The
   effective cap was the 1 MiB HTTP body limit, which is far larger than
   any legitimate Tencent action payload — 64 KiB still fits user-data
   scripts and policy documents while rejecting accidentally-pasted
   dumps from an LLM plan.
@rephapeng
Copy link
Copy Markdown
Contributor Author

Nits tier landed in 7a71923. With this, every item from your review is addressed.

  1. Credentials redaction (internal/tencent/client.go) — added String() + MarshalJSON() so %v/%+v/Println/json.Marshal all render SecretKey as ****. Direct field access (the SDK signature path) is unchanged so functionality is identical.
  2. Doubled gather call (internal/api/routes.go) — handleTencentResources no longer calls GetRelevantContext before gatherTencentByType. Each request now hits the SDK once instead of twice. Verified end-to-end (lighthouse/cvm/cbs lists still return data).
  3. NewClientWithCredentials factory (internal/tencent/client.go) — added BackendTencentCredentials struct + constructor matching the shape AWS / GCP / Fly.io already use, so the backend dispatch layer can treat Tencent uniformly once Tencent is wired into the backend credential flow.
  4. paramsJSON size cap (internal/tencent/raw.go) — maxParamsJSONBytes = 256 KiB total, maxParamsFieldBytes = 64 KiB per string field (walked recursively into nested maps and slices). Effective ceiling was the 1 MiB HTTP body limit, which is well over any legitimate Tencent action payload.

Status of the full review:

Tier Status
Critical (3) Shipped in 1bbb914
High (3) Shipped in 2a18f7b
Should-fix (3) Shipped in e5530b3
Nits (4) Shipped in 7a71923
Rebase onto current master Done — branch base is d4d94f9

The only thing left from your notes is paginating the remaining low-cardinality gather functions (CAM, NAT, VPN, CCN, DC, etc.) — those typically have <10 items per account so they don't hit the 100-row truncation in practice, but I'm happy to do them if you'd prefer everything paginated for uniformity.

Ready for another pass whenever you have a moment. Thanks again for the thorough first review.

Cron-facing alert that walks every PREPAID-capable resource (CVM,
Lighthouse, CBS, MySQL, Postgres, Redis, MongoDB, CynosDB, CLB,
AntiDDoS, and SSL with --include-ssl) across the requested regions and
returns items at or below a renewal threshold.

  • CLI:  clanker tencent expiry --regions=ap-x,ap-y --threshold=14
          Exit 0 = nothing flagged, 1 = items in window, 2 = already
          expired. Drop-in for crontab + MAILTO, GitHub Actions, etc.

  • HTTP: GET /api/v1/tencent/expiry?regions=&threshold=&manual_only=&include_ssl=
          Returns the full report with counts breakdown (total / flagged
          / expired / auto_renew). Region query param is validated through
          the same regex SendRaw uses, so SSRF-shaped inputs are rejected.

manual_only defaults to true so the cron only surfaces items that won't
auto-renew (the actionable subset); auto-renewers are still counted in
counts.auto_renew for visibility.
Copy link
Copy Markdown
Collaborator

@rafeegnash rafeegnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just walked the new revision end-to-end — really nice work, this looks great overall. Every item from the first review is properly addressed:

  • Rebase is clean (base is 2728110, no K8s deletions).
  • Constant-time token compare — verified.
  • Server refuses to start without a token (or explicit --insecure) — built the binary and confirmed empirically.
  • ReDoS defense on filter matches — pattern cap + goroutine-with-timeout on top of Go's RE2.
  • Region regex chokepoint with the typed *errInvalidRegion.
  • CORS default narrowed to http://localhost:4173.
  • Destructive allowlist (Describe|Get|List|Query|Lookup|Search|Check|Inquiry) — AddUser / CreateAccessKey / AttachUserPolicy correctly blocked without --destroyer.
  • HttpProfile.ReqTimeout = 30s + ctxDone between pages.
  • Pagination on CVM/VPC/SG up to 1000 with truncation warning.
  • Filter tests covering happy + sad paths + ReDoS defense.
  • Credential String() + MarshalJSON() redaction.
  • Doubled GetRelevantContext call removed.
  • NewClientWithBackendCredentials factory.
  • paramsJSON size caps.

Also reviewed the new /api/v1/tencent/expiry endpoint from 1a23646 — it validates regions / threshold / booleans cleanly. 👍

One blocker I caught while running the built binary: a single-request crash in writeTencentClientErr from the security commit (see inline comment). Genuinely a tiny typo-class bug — one line change to fix — but since the credential-missing error path is the most common non-region failure, any prod restart without env vars set hits it. Worth a quick httptest covering "no creds → 401" to lock it in.

A few smaller items I noticed while looking, none are blockers:

  • internal/api/routes_maker.go:147countDestructiveCommands still uses the old prefix denylist (Terminate|Delete|Destroy|Reset|Release|Discontinue) with the ResetInstancesPassword carve-out. Now that the executor is allowlist-based, the audit record undercounts: an AddUser blocked by --destroyer records as 0 destructive commands. Easy to fix by calling into isTencentDestructive from exec_tencent.go.
  • internal/api/middleware.go — failed bearer attempts return 401 but never reach logMiddleware (it's nested inside auth), so 401s are silent. A one-line log in the 401 branch would help for prod observability.
  • internal/api/server.go:28 comment still says "*" by default — should reflect the new http://localhost:4173 default.
  • internal/maker/exec_tencent.go:124-126 docstring still describes the old prefix denylist; should be updated to the allowlist.
  • routes_plan.go returns 200 OK + a warning field when the LLM emits unparseable JSON — would be cleaner as a 4xx so clients don't have to inspect the body to know it failed.

If you fix the recursion + roll a quick test for it, the rest of the small stuff can either come along in the same commit or land as a tiny follow-up — totally up to you. Thanks again for the careful work on this round, the diff reads really well now.

Comment thread internal/api/routes.go Outdated
writeError(w, http.StatusBadRequest, "invalid_region", err.Error())
return
}
writeTencentClientErr(w, err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny but critical bug from the security-tier cleanup — this branch recurses with the same args instead of returning the credential-failure response. Any non-*errInvalidRegion error (specifically the one tencent.NewClient returns when credentials are missing) hits this line and the function calls itself until the goroutine stack runs out.

Reproduced by building the PR + running clanker server --port 18080 --token x with no Tencent creds set, then curl -H 'Authorization: Bearer x' http://127.0.0.1:18080/api/v1/tencent/regions — the process dies with:

fatal error: stack overflow
github.com/bgdnvk/clanker/internal/api.writeTencentClientErr(...)
github.com/bgdnvk/clanker/internal/api.writeTencentClientErr(...)
github.com/bgdnvk/clanker/internal/api.writeTencentClientErr(...)
...

Single authenticated request → server gone. Worth catching because credential rotation in prod would hit this without anyone touching the server.

One-line fix matching the intent documented in the PR description:

func writeTencentClientErr(w http.ResponseWriter, err error) {
    var ir *errInvalidRegion
    if errors.As(err, &ir) {
        writeError(w, http.StatusBadRequest, "invalid_region", err.Error())
        return
    }
    writeError(w, http.StatusUnauthorized, "tencent_credentials", err.Error())
}

Would also be a good place to add a tiny httptest against any one Tencent handler with no creds set — that test would have caught this in CI.

rephapeng added a commit to rephapeng/clanker that referenced this pull request May 25, 2026
Six items from rafeegnash's second-pass review. The first is a
real production-killer; the rest are correctness + hygiene.

1. writeTencentClientErr no longer infinite-recurses (api/routes.go)
   The catch-all branch from the security commit called itself instead
   of writeError — stack overflow on the very first credential-missing
   request, taking the process with it. New regression test in
   api/routes_test.go (httptest + direct helper invocation, including
   the wrapped-error errors.As path) locks the fix in.

2. countDestructiveCommands uses the live classifier (api/routes_maker.go)
   The plan audit count was still computed from the old prefix denylist,
   so AddUser / CreateAccessKey / AttachUserPolicy gated by --destroyer
   recorded as 0 destructive commands. Exported IsTencentDestructive from
   the maker package; routes_maker now delegates to it so the audit
   count never drifts away from what the executor's safety gate enforces.

3. 401s no longer silent (api/middleware.go)
   authMiddleware wraps logMiddleware so a rejected request short-circuits
   before any access log fires. Added s.log401() that records method +
   path + remote_addr + reason on every failed bearer attempt, so prod
   credential rotations or auth attacks show up in stderr.

4. routes_plan returns 422 on unparseable LLM output (api/routes_plan.go)
   Was 200 + 'warning' field, which forced clients to inspect the body
   to know it failed. 422 Unprocessable Entity is the right semantic —
   we understood the request but the upstream result was unprocessable.
   The raw cleaned text is still in data.plan so the dashboard editor
   can offer hand-correction.

5. Comments brought in line with code (api/server.go, maker/exec_tencent.go)
   Config.CORSOrigin doc said '"*" by default'; now says http://localhost:4173.
   validateTencentCommand docstring described the old prefix denylist;
   now describes the allowlist.
Six items from rafeegnash's second-pass review. The first is a
real production-killer; the rest are correctness + hygiene.

1. writeTencentClientErr no longer infinite-recurses (api/routes.go)
   The catch-all branch from the security commit called itself instead
   of writeError — stack overflow on the very first credential-missing
   request, taking the process with it. New regression test in
   api/routes_test.go (httptest + direct helper invocation, including
   the wrapped-error errors.As path) locks the fix in.

2. countDestructiveCommands uses the live classifier (api/routes_maker.go)
   The plan audit count was still computed from the old prefix denylist,
   so AddUser / CreateAccessKey / AttachUserPolicy gated by --destroyer
   recorded as 0 destructive commands. Exported IsTencentDestructive from
   the maker package; routes_maker now delegates to it so the audit
   count never drifts away from what the executor's safety gate enforces.

3. 401s no longer silent (api/middleware.go)
   authMiddleware wraps logMiddleware so a rejected request short-circuits
   before any access log fires. Added s.log401() that records method +
   path + remote_addr + reason on every failed bearer attempt, so prod
   credential rotations or auth attacks show up in stderr.

4. routes_plan returns 422 on unparseable LLM output (api/routes_plan.go)
   Was 200 + 'warning' field, which forced clients to inspect the body
   to know it failed. 422 Unprocessable Entity is the right semantic —
   we understood the request but the upstream result was unprocessable.
   The raw cleaned text is still in data.plan so the dashboard editor
   can offer hand-correction.

5. Comments brought in line with code (api/server.go, maker/exec_tencent.go)
   Config.CORSOrigin doc said '"*" by default'; now says http://localhost:4173.
   validateTencentCommand docstring described the old prefix denylist;
   now describes the allowlist.
@rephapeng
Copy link
Copy Markdown
Contributor Author

Round-2 fixes landed in 6e2373b. Thanks for the careful walk-through and especially for catching the recursion — that one would've been bad on a credential rotation.

Recursion fix + regression test (internal/api/routes.go, internal/api/routes_test.go)
The catch-all branch now writes the 401 envelope instead of calling itself. New routes_test.go adds three table cases via httptest: (1) no credentials → 401 with tencent_credentials (the case that previously stack-overflowed), (2) ?region=evil%20region → 400 with invalid_region, (3) direct writeTencentClientErr calls including the wrapped-with-%w path through errors.As. go test ./internal/api/ passes.

countDestructiveCommands delegates to the live classifier (internal/api/routes_maker.go)
Exported IsTencentDestructive from maker/exec_tencent.go; routes_maker now calls it instead of carrying a duplicate denylist. The audit-record count tracks the executor's safety gate by construction — AddUser / CreateAccessKey / AttachUserPolicy are now counted as destructive in the history record same as they're blocked in execution.

401 logging (internal/api/middleware.go)
Added s.log401(r, reason) that writes [api] 401 METHOD URI from REMOTE — REASON on every rejected attempt. logMiddleware sits inside authMiddleware so 401s never reach the normal access log — verified with a wrong-token curl that the new log line shows up in stderr.

routes_plan returns 422 on unparseable LLM output (internal/api/routes_plan.go)
Now 422 Unprocessable Entity with an error.code = unparseable_plan envelope. The raw cleaned text and original parse error are still in data.plan so the dashboard editor can offer hand-correction; just the status code changed.

Stale comments updated

  • Config.CORSOrigin doc: "*" by defaultdefaults to http://localhost:4173.
  • validateTencentCommand docstring: prefix-denylist text → describes the read-only allowlist (Describe / Get / List / Query / Lookup / Search / Check / Inquiry).

Let me know if you'd like the destructive-classifier moved to a shared location (it currently lives in maker and gets imported by api) or anything else from this batch tweaked.

@rafeegnash rafeegnash merged commit 2aa4d34 into bgdnvk:master May 25, 2026
@rafeegnash
Copy link
Copy Markdown
Collaborator

Merged 🎉 Thank you for the careful work on this — really appreciate how thoroughly you walked through every item across the three review rounds. Recursion fix verified empirically against the built binary (same trigger that previously stack-overflowed now returns a clean 401 + JSON envelope, server stays up, 401 appears in the logs with method / URI / remote / reason). All five other items also addressed cleanly, and the new internal/api/routes_test.go will keep this from regressing in CI.

Tencent provider is now in master alongside the other clouds. Welcome aboard, and thanks again — this is great work to start from. 🙌

rephapeng added a commit to rephapeng/clanker that referenced this pull request May 28, 2026
Upstream merged PR bgdnvk#165 (Tencent provider) and added work on top:
k8s SRE playbooks (bgdnvk#174), SRE agent fix (bgdnvk#177), tree-wide gofmt -s
(bgdnvk#176), README (bgdnvk#175), and three Tencent CLI features the fork lacked
— `list --format json` (bgdnvk#179), `cost --format json` (bgdnvk#180), and
security-scan CLI subcommands (bgdnvk#181).

Conflict resolution: all 16 conflicts resolved to upstream's side.
14 were pure gofmt whitespace from bgdnvk#176 (identical code); billing.go
and static_commands.go were upstream supersets adding the JSON/security
CLI surface with no fork-unique code lost. Fixed a duplicate tencent
import in cmd/ask.go left by the auto-merge.

Verified in Docker (golang:1.25, -mod=mod): gofmt clean, go build ./...,
go vet ./..., and go test ./... all pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants