Skip to content

Cedarling Telemetry RFC

Oleh edited this page Apr 2, 2026 · 1 revision

Cedarling Telemetry

This document defines the telemetry data that a Cedarling instance can collect and send to the Lock Server via the /audit/telemetry/bulk endpoint. Telemetry is organized into three maps within the TelemetryEntry message: policy_stats, error_counters, and operational_stats.

All counters are reset to zero at the start of each telemetry interval. Gauge-type values (marked with gauge) reflect a point-in-time snapshot taken at flush time.

Proto definition

message TelemetryEntry {
  google.protobuf.Timestamp creation_date = 1;
  string service = 3;
  string node_name = 4;
  string status = 5;
  // fields 6-12 removed: migrated to operational_stats map for extensibility
  map<string, int64> policy_stats = 13;
  map<string, int64> error_counters = 14;
  map<string, int64> operational_stats = 15;
  int64 interval_secs = 16;  // the collection period this entry covers
}

Field notes

creation_date — Timestamp when the telemetry entry was created and sent. For telemetry this represents the end of the collection window (the start can be derived as creation_date - interval_secs). The previous event_time field has been removed because for telemetry there is no single "event" — the entry covers a time range, not a point in time. The interval_secs field together with creation_date fully defines the collection window.

Note: The existing LogEntry and HealthEntry messages in audit.proto retain both creation_date and event_time for backwards compatibility. For those message types, both fields are currently set to the same value.

interval_secs field

The interval_secs field indicates the duration (in seconds) of the collection window that this TelemetryEntry covers. All counter values in error_counters, operational_stats, and policy_stats represent totals accumulated over this period. The receiver needs this value to compute rates (e.g., requests per second = authz.requests_total / interval_secs).

The value comes from the CEDARLING_LOCK_TELEMETRY_INTERVAL bootstrap property, which controls both the collection period and the send interval. Under normal operation, interval_secs equals the configured interval. It may differ if the instance was just started (first partial interval) or is shutting down (final flush before exit).

Bootstrap configuration

The existing CEDARLING_LOCK_TELEMETRY_INTERVAL property (in seconds, 0 = disabled) currently controls only the send interval. With telemetry implemented, it also defines the collection window — counters are reset after each send.

Property Type Default Description
CEDARLING_LOCK_TELEMETRY_INTERVAL u64 0 (disabled) How often (in seconds) to collect, reset, and send telemetry to the Lock Server. Value of 0 disables telemetry entirely. Recommended: 60

Note: Currently there is no separate property for the collection period vs. the send interval — they are always equal. If a use case arises where finer-grained collection is needed (e.g., collect every 10s for more accurate percentiles but send every 60s), a new property CEDARLING_LOCK_TELEMETRY_COLLECTION_INTERVAL could be introduced. Until then, CEDARLING_LOCK_TELEMETRY_INTERVAL serves both purposes.

REST API example

The telemetry REST endpoint follows the same pattern as the existing log endpoint: a POST with a JSON array body to the /bulk sub-path. The JSON field names match the proto field names using snake_case (same convention as LogEntry / HealthEntry).

Request:

POST /jans-auth/v1/audit/telemetry/bulk HTTP/1.1
Host: lock.example.com
Authorization: Bearer <access_token>
Content-Type: application/json
[
  {
    "creation_date": "2026-04-02T10:05:00Z",
    "service": "my_app",
    "node_name": "019615a1-d7c0-7fc0-8000-deadbeef1234",
    "status": "running",
    "interval_secs": 60,
    "policy_stats": {
      "allow_read_documents": 340,
      "deny_admin_access": 12,
      "allow_user_update": 88
    },
    "error_counters": {
      "jwt.validation_failed": 8,
      "jwt.untrusted_issuer": 2,
      "authz.entity_build": 3,
      "data.storage_limit": 1
    },
    "operational_stats": {
      "authz.requests_total": 1240,
      "authz.requests_unsigned": 200,
      "authz.requests_multi_issuer": 1040,
      "authz.decision_allow": 1218,
      "authz.decision_deny": 8,
      "authz.errors_total": 14,
      "authz.last_eval_time_us": 38,
      "authz.eval_time_p50_us": 35,
      "authz.eval_time_p95_us": 120,
      "authz.eval_time_p99_us": 450,
      "authz.eval_time_max_us": 1200,
      "authz.principals_per_request_avg": 2,
      "token_cache.hits": 890,
      "token_cache.misses": 350,
      "token_cache.size": 128,
      "token_cache.evictions": 12,
      "jwt.validations_total": 1240,
      "jwt.validations_success": 1227,
      "jwt.validations_failed": 13,
      "jwt.tokens_skipped_untrusted": 2,
      "data.entries_count": 45,
      "data.total_size_bytes": 12800,
      "data.push_ops": 30,
      "data.get_ops": 1040,
      "data.remove_ops": 5,
      "data.ttl_expirations": 8,
      "data.memory_alert_triggered": 0,
      "lock.batches_sent": 6,
      "lock.entries_sent": 1240,
      "lock.retries": 0,
      "lock.queue_depth": 0,
      "instance.uptime_secs": 3600,
      "instance.memory_usage_bytes": 52428800,
      "instance.policy_count": 15,
      "instance.trusted_issuers_loaded": 3,
      "instance.trusted_issuers_failed": 0
    }
  }
]

Response (success):

{
  "success": true,
  "message": ""
}

Notes:

  • The endpoint URL is derived from the Lock Server's .well-known/lock-server-configuration response (audit.telemetry_endpoint), with /bulk appended automatically — same as for the log endpoint.
  • The Authorization header uses the same Bearer token obtained during Dynamic Client Registration (scope: https://jans.io/oauth/lock/telemetry.write).
  • The body is always a JSON array (even with a single entry) to match the BulkTelemetryRequest proto pattern.
  • Empty maps may be omitted or sent as {}. Zero-value counters may be omitted to reduce payload size.

Migration from previous proto fields

The following dedicated fields (6-12) have been replaced by keys in the operational_stats map. This makes the protocol extensible — adding a new metric requires only a new map key, not a proto change.

Old field Old number New location Reason
last_policy_load_size 6 operational_stats["instance.policy_count"] Gauge metric, fits naturally in the map
policy_success_load_counter 7 operational_stats["instance.trusted_issuers_loaded"] Duplicate concept, consolidated
policy_failed_load_counter 8 operational_stats["instance.trusted_issuers_failed"] Duplicate concept, consolidated
last_policy_evaluation_time_ns 9 operational_stats["authz.last_eval_time_us"] Moved to latency section (units changed to microseconds)
avg_policy_evaluation_time_ns 10 Removed — replaced by percentiles Averages hide outliers. Replaced by authz.eval_time_p50_us, authz.eval_time_p95_us, authz.eval_time_p99_us
memory_usage 11 operational_stats["instance.memory_usage_bytes"] Gauge metric, fits naturally in the map
evaluation_requests_count 12 operational_stats["authz.requests_total"] Counter metric, consolidated with other authz stats

policy_stats

Per-policy evaluation counts. Each key is a policy ID from the policy store, and the value is the number of times that policy was referenced in a Cedar reason (i.e., contributed to an authorization decision) during the interval.

Key pattern Value Description
<policy_id> count Number of times this policy appeared in the Cedar diagnostics reason set during the interval

Example:

{
  "allow_read_documents": 340,
  "deny_admin_access": 12,
  "allow_user_update": 88
}

error_counters

Classification counters for errors encountered during the telemetry interval. Each key identifies a specific error variant and the value is the number of occurrences.

JWT validation errors

Source: ValidateJwtError, JwtProcessingError

Key Source Description
jwt.decode_failed ValidateJwtError::DecodeJwt JWT is malformed: not in header.payload.signature format, base64 decode failure, or header/claims JSON deserialization failure
jwt.missing_key ValidateJwtError::MissingValidationKey No decoding key available for this JWT's kid. JWKS may not have been fetched or the key was rotated
jwt.missing_validator ValidateJwtError::MissingValidator No validator initialized for this issuer+algorithm combination. Either the issuer is untrusted or the algorithm is not in signature_algorithms_supported
jwt.validation_failed ValidateJwtError::ValidateJwt Signature verification or standard claim validation failed (expired, wrong audience, nbf in future, etc.)
jwt.missing_claims ValidateJwtError::MissingClaims Token is missing required claims defined per token type in the trusted issuer config (e.g. sub, iss, aud)
jwt.status_check_failed ValidateJwtError::GetJwtStatus Failed to fetch or parse the JWT status list reference
jwt.status_rejected ValidateJwtError::RejectJwtStatus Token was revoked or suspended according to the IETF status list
jwt.missing_status_list ValidateJwtError::MissingStatusList Status validation is enabled but no status list is available for this token
jwt.untrusted_issuer ValidateJwtError::TrustedIssuerValidation(UntrustedIssuer) Token's iss claim does not match any trusted issuer in the policy store
jwt.missing_required_claim ValidateJwtError::TrustedIssuerValidation(MissingRequiredClaim) Trusted issuer requires a specific claim that the token does not have
jwt.signed_authz_unavailable JwtProcessingError::SignedAuthzUnavailable Signed authorization requested but no trusted issuers or JWKS were configured

Authorization errors

Source: AuthorizeError

Key Source Description
authz.invalid_action AuthorizeError::Action The action string could not be parsed as a valid Cedar EntityUid
authz.identifier_parsing AuthorizeError::IdentifierParsing An entity type name or action identifier failed to parse
authz.invalid_context AuthorizeError::CreateContext Context JSON does not conform to the Cedar schema for this action
authz.invalid_principal AuthorizeError::InvalidPrincipal The Cedar request for a principal does not conform to the schema
authz.request_validation AuthorizeError::RequestValidation Cedar request validation failed (schema mismatch between action, principal, resource)
authz.entity_validation AuthorizeError::ValidateEntities Built entities violate the Cedar schema constraints
authz.entity_build AuthorizeError::BuildEntity Failed to construct a Cedar entity: missing entity ID, invalid UID format, attribute evaluation failure, or entity type not in schema
authz.context_build AuthorizeError::BuildContext Context merging failed: key conflict between request context and pushed data, unknown action in schema, or missing entity reference
authz.rule_execution AuthorizeError::ExecuteRule Failed to apply principal_bool_operator aggregation rule across principals
authz.unsigned_role_build AuthorizeError::BuildUnsignedRoleEntity Role field in unsigned request is not a string or array of strings

Multi-issuer validation errors

Source: MultiIssuerValidationError, MultiIssuerEntityError

Key Source Description
multi_issuer.empty_token_array MultiIssuerValidationError::EmptyTokenArray The tokens array in the request is empty
multi_issuer.token_input_invalid MultiIssuerValidationError::TokenInput Token input has empty mapping string or empty payload
multi_issuer.all_tokens_failed MultiIssuerValidationError::TokenValidationFailed All tokens in the request failed validation, no valid tokens to evaluate
multi_issuer.invalid_context MultiIssuerValidationError::InvalidContextJson The optional context field is not valid JSON
multi_issuer.missing_issuer MultiIssuerValidationError::MissingIssuer A JWT is missing the iss claim, cannot determine trusted issuer
multi_issuer.entity_build AuthorizeError::MultiIssuerEntity Entity construction failed for multi-issuer flow (missing exp, invalid UID, no valid tokens, attribute build failure)

Data store errors

Source: DataError

Key Source Description
data.invalid_key DataError::InvalidKey Push or get called with an empty key
data.key_not_found DataError::KeyNotFound Requested key does not exist in the data store
data.storage_limit DataError::StorageLimitExceeded max_entries reached, cannot push more data
data.ttl_exceeded DataError::TTLExceeded Requested TTL is larger than max_ttl configured
data.value_too_large DataError::ValueTooLarge Entry size exceeds max_entry_size
data.serialization DataError::SerializationError JSON serialization of the value failed

Lock service transport errors

Source: LogWorker, RestTransport

Key Source Description
lock.send_failed LogWorker::flush_logs retry path Failed to send log batch to Lock Server (network error, 4xx/5xx response)
lock.channel_full LockService::log_any try_send error The mpsc channel to LogWorker is full, logs are being produced faster than sent
lock.malformed_entry RestTransport::send_logs skip Log entry could not be deserialized or mapped to Lock Server format

operational_stats

Operational metrics collected during the telemetry interval. Values are either counters (reset each interval) or gauges (point-in-time snapshot at flush time).

Authorization decisions

Key Type Description
authz.requests_total counter Total number of authorization requests (authorize_unsigned + authorize_multi_issuer)
authz.requests_unsigned counter Number of authorize_unsigned calls
authz.requests_multi_issuer counter Number of authorize_multi_issuer calls
authz.decision_allow counter Number of requests that resulted in ALLOW
authz.decision_deny counter Number of requests that resulted in DENY
authz.errors_total counter Total number of requests that returned an error (did not reach a decision)

Authorization latency

Latency is reported using percentiles instead of averages. Averages hide outliers — a single slow request can go unnoticed if averaged with thousands of fast ones. Percentiles give a full distribution picture: P50 shows typical latency, P95 shows degraded experience for tail users, P99/max detect spikes and worst-case scenarios.

Key Type Description
authz.last_eval_time_us gauge Last policy evaluation time in microseconds
authz.eval_time_p50_us gauge Median (50th percentile) evaluation time in microseconds — typical request latency
authz.eval_time_p95_us gauge 95th percentile evaluation time — latency experienced by the slow tail
authz.eval_time_p99_us gauge 99th percentile evaluation time — worst-case excluding extreme outliers
authz.eval_time_max_us gauge Maximum evaluation time in the interval — detects spikes
authz.principals_per_request_avg gauge Average number of principals evaluated per unsigned request

Token cache

Key Type Description
token_cache.hits counter Number of cache hits (token reused without re-validation)
token_cache.misses counter Number of cache misses (full validation required)
token_cache.size gauge Current number of entries in the token cache
token_cache.evictions counter Number of entries evicted (TTL expiry or capacity)

JWT validation

Key Type Description
jwt.validations_total counter Total number of individual JWT validations attempted
jwt.validations_success counter Number of JWTs that passed validation
jwt.validations_failed counter Number of JWTs that failed validation (sum should match individual error_counters jwt.* keys)
jwt.tokens_skipped_untrusted counter Tokens skipped in multi-issuer flow because issuer was not trusted (warning, not error)

Data store

Key Type Description
data.entries_count gauge Current number of entries in the pushed data store
data.total_size_bytes gauge Total memory used by data store entries
data.push_ops counter Number of push_data_ctx calls
data.get_ops counter Number of get_data_ctx calls
data.remove_ops counter Number of remove_data_ctx calls
data.ttl_expirations counter Number of entries that expired due to TTL
data.memory_alert_triggered gauge 1 if memory alert threshold was crossed during the interval, 0 otherwise

Lock service transport

Key Type Description
lock.batches_sent counter Number of log batches successfully sent to Lock Server
lock.entries_sent counter Total number of log entries sent
lock.retries counter Number of retry attempts for failed sends
lock.queue_depth gauge Current number of unsent entries in the log buffer

Instance

Key Type Description
instance.uptime_secs gauge Seconds since Cedarling instance was initialized
instance.memory_usage_bytes gauge Process memory usage in bytes (RSS)
instance.policy_count gauge Number of policies in the loaded policy store
instance.trusted_issuers_loaded gauge Number of successfully loaded trusted issuers
instance.trusted_issuers_failed gauge Number of trusted issuers that failed to load

Clone this wiki locally