-
Notifications
You must be signed in to change notification settings - Fork 168
Cedarling Telemetry RFC
This document defines the telemetry data that a Cedarling instance can collect and
send to the Lock Server via the /audit/telemetry/bulk endpoint. Telemetry is
organized into three maps within the TelemetryEntry message: policy_stats,
error_counters, and operational_stats.
All counters are reset to zero at the start of each telemetry interval. Gauge-type values (marked with gauge) reflect a point-in-time snapshot taken at flush time.
message TelemetryEntry {
google.protobuf.Timestamp creation_date = 1;
string service = 3;
string node_name = 4;
string status = 5;
// fields 6-12 removed: migrated to operational_stats map for extensibility
map<string, int64> policy_stats = 13;
map<string, int64> error_counters = 14;
map<string, int64> operational_stats = 15;
int64 interval_secs = 16; // the collection period this entry covers
}creation_date — Timestamp when the telemetry entry was created and sent. For
telemetry this represents the end of the collection window (the start can be derived
as creation_date - interval_secs). The previous event_time field has been removed
because for telemetry there is no single "event" — the entry covers a time range, not
a point in time. The interval_secs field together with creation_date fully defines
the collection window.
Note: The existing
LogEntryandHealthEntrymessages inaudit.protoretain bothcreation_dateandevent_timefor backwards compatibility. For those message types, both fields are currently set to the same value.
The interval_secs field indicates the duration (in seconds) of the collection window
that this TelemetryEntry covers. All counter values in error_counters,
operational_stats, and policy_stats represent totals accumulated over this period.
The receiver needs this value to compute rates (e.g., requests per second =
authz.requests_total / interval_secs).
The value comes from the CEDARLING_LOCK_TELEMETRY_INTERVAL bootstrap property, which
controls both the collection period and the send interval. Under normal operation,
interval_secs equals the configured interval. It may differ if the instance was just
started (first partial interval) or is shutting down (final flush before exit).
The existing CEDARLING_LOCK_TELEMETRY_INTERVAL property (in seconds, 0 = disabled)
currently controls only the send interval. With telemetry implemented, it also defines
the collection window — counters are reset after each send.
| Property | Type | Default | Description |
|---|---|---|---|
CEDARLING_LOCK_TELEMETRY_INTERVAL |
u64 |
0 (disabled) |
How often (in seconds) to collect, reset, and send telemetry to the Lock Server. Value of 0 disables telemetry entirely. Recommended: 60
|
Note: Currently there is no separate property for the collection period vs. the send interval — they are always equal. If a use case arises where finer-grained collection is needed (e.g., collect every 10s for more accurate percentiles but send every 60s), a new property
CEDARLING_LOCK_TELEMETRY_COLLECTION_INTERVALcould be introduced. Until then,CEDARLING_LOCK_TELEMETRY_INTERVALserves both purposes.
The telemetry REST endpoint follows the same pattern as the existing log endpoint:
a POST with a JSON array body to the /bulk sub-path. The JSON field names match
the proto field names using snake_case (same convention as LogEntry / HealthEntry).
Request:
POST /jans-auth/v1/audit/telemetry/bulk HTTP/1.1
Host: lock.example.com
Authorization: Bearer <access_token>
Content-Type: application/json[
{
"creation_date": "2026-04-02T10:05:00Z",
"service": "my_app",
"node_name": "019615a1-d7c0-7fc0-8000-deadbeef1234",
"status": "running",
"interval_secs": 60,
"policy_stats": {
"allow_read_documents": 340,
"deny_admin_access": 12,
"allow_user_update": 88
},
"error_counters": {
"jwt.validation_failed": 8,
"jwt.untrusted_issuer": 2,
"authz.entity_build": 3,
"data.storage_limit": 1
},
"operational_stats": {
"authz.requests_total": 1240,
"authz.requests_unsigned": 200,
"authz.requests_multi_issuer": 1040,
"authz.decision_allow": 1218,
"authz.decision_deny": 8,
"authz.errors_total": 14,
"authz.last_eval_time_us": 38,
"authz.eval_time_p50_us": 35,
"authz.eval_time_p95_us": 120,
"authz.eval_time_p99_us": 450,
"authz.eval_time_max_us": 1200,
"authz.principals_per_request_avg": 2,
"token_cache.hits": 890,
"token_cache.misses": 350,
"token_cache.size": 128,
"token_cache.evictions": 12,
"jwt.validations_total": 1240,
"jwt.validations_success": 1227,
"jwt.validations_failed": 13,
"jwt.tokens_skipped_untrusted": 2,
"data.entries_count": 45,
"data.total_size_bytes": 12800,
"data.push_ops": 30,
"data.get_ops": 1040,
"data.remove_ops": 5,
"data.ttl_expirations": 8,
"data.memory_alert_triggered": 0,
"lock.batches_sent": 6,
"lock.entries_sent": 1240,
"lock.retries": 0,
"lock.queue_depth": 0,
"instance.uptime_secs": 3600,
"instance.memory_usage_bytes": 52428800,
"instance.policy_count": 15,
"instance.trusted_issuers_loaded": 3,
"instance.trusted_issuers_failed": 0
}
}
]Response (success):
{
"success": true,
"message": ""
}Notes:
- The endpoint URL is derived from the Lock Server's
.well-known/lock-server-configurationresponse (audit.telemetry_endpoint), with/bulkappended automatically — same as for the log endpoint. - The
Authorizationheader uses the same Bearer token obtained during Dynamic Client Registration (scope:https://jans.io/oauth/lock/telemetry.write). - The body is always a JSON array (even with a single entry) to match the
BulkTelemetryRequestproto pattern. - Empty maps may be omitted or sent as
{}. Zero-value counters may be omitted to reduce payload size.
The following dedicated fields (6-12) have been replaced by keys in the
operational_stats map. This makes the protocol extensible — adding a new metric
requires only a new map key, not a proto change.
| Old field | Old number | New location | Reason |
|---|---|---|---|
last_policy_load_size |
6 | operational_stats["instance.policy_count"] |
Gauge metric, fits naturally in the map |
policy_success_load_counter |
7 | operational_stats["instance.trusted_issuers_loaded"] |
Duplicate concept, consolidated |
policy_failed_load_counter |
8 | operational_stats["instance.trusted_issuers_failed"] |
Duplicate concept, consolidated |
last_policy_evaluation_time_ns |
9 | operational_stats["authz.last_eval_time_us"] |
Moved to latency section (units changed to microseconds) |
avg_policy_evaluation_time_ns |
10 | Removed — replaced by percentiles | Averages hide outliers. Replaced by authz.eval_time_p50_us, authz.eval_time_p95_us, authz.eval_time_p99_us
|
memory_usage |
11 | operational_stats["instance.memory_usage_bytes"] |
Gauge metric, fits naturally in the map |
evaluation_requests_count |
12 | operational_stats["authz.requests_total"] |
Counter metric, consolidated with other authz stats |
Per-policy evaluation counts. Each key is a policy ID from the policy store, and the
value is the number of times that policy was referenced in a Cedar reason (i.e.,
contributed to an authorization decision) during the interval.
| Key pattern | Value | Description |
|---|---|---|
<policy_id> |
count | Number of times this policy appeared in the Cedar diagnostics reason set during the interval |
Example:
{
"allow_read_documents": 340,
"deny_admin_access": 12,
"allow_user_update": 88
}Classification counters for errors encountered during the telemetry interval. Each key identifies a specific error variant and the value is the number of occurrences.
Source: ValidateJwtError, JwtProcessingError
| Key | Source | Description |
|---|---|---|
jwt.decode_failed |
ValidateJwtError::DecodeJwt |
JWT is malformed: not in header.payload.signature format, base64 decode failure, or header/claims JSON deserialization failure |
jwt.missing_key |
ValidateJwtError::MissingValidationKey |
No decoding key available for this JWT's kid. JWKS may not have been fetched or the key was rotated |
jwt.missing_validator |
ValidateJwtError::MissingValidator |
No validator initialized for this issuer+algorithm combination. Either the issuer is untrusted or the algorithm is not in signature_algorithms_supported
|
jwt.validation_failed |
ValidateJwtError::ValidateJwt |
Signature verification or standard claim validation failed (expired, wrong audience, nbf in future, etc.) |
jwt.missing_claims |
ValidateJwtError::MissingClaims |
Token is missing required claims defined per token type in the trusted issuer config (e.g. sub, iss, aud) |
jwt.status_check_failed |
ValidateJwtError::GetJwtStatus |
Failed to fetch or parse the JWT status list reference |
jwt.status_rejected |
ValidateJwtError::RejectJwtStatus |
Token was revoked or suspended according to the IETF status list |
jwt.missing_status_list |
ValidateJwtError::MissingStatusList |
Status validation is enabled but no status list is available for this token |
jwt.untrusted_issuer |
ValidateJwtError::TrustedIssuerValidation(UntrustedIssuer) |
Token's iss claim does not match any trusted issuer in the policy store |
jwt.missing_required_claim |
ValidateJwtError::TrustedIssuerValidation(MissingRequiredClaim) |
Trusted issuer requires a specific claim that the token does not have |
jwt.signed_authz_unavailable |
JwtProcessingError::SignedAuthzUnavailable |
Signed authorization requested but no trusted issuers or JWKS were configured |
Source: AuthorizeError
| Key | Source | Description |
|---|---|---|
authz.invalid_action |
AuthorizeError::Action |
The action string could not be parsed as a valid Cedar EntityUid
|
authz.identifier_parsing |
AuthorizeError::IdentifierParsing |
An entity type name or action identifier failed to parse |
authz.invalid_context |
AuthorizeError::CreateContext |
Context JSON does not conform to the Cedar schema for this action |
authz.invalid_principal |
AuthorizeError::InvalidPrincipal |
The Cedar request for a principal does not conform to the schema |
authz.request_validation |
AuthorizeError::RequestValidation |
Cedar request validation failed (schema mismatch between action, principal, resource) |
authz.entity_validation |
AuthorizeError::ValidateEntities |
Built entities violate the Cedar schema constraints |
authz.entity_build |
AuthorizeError::BuildEntity |
Failed to construct a Cedar entity: missing entity ID, invalid UID format, attribute evaluation failure, or entity type not in schema |
authz.context_build |
AuthorizeError::BuildContext |
Context merging failed: key conflict between request context and pushed data, unknown action in schema, or missing entity reference |
authz.rule_execution |
AuthorizeError::ExecuteRule |
Failed to apply principal_bool_operator aggregation rule across principals |
authz.unsigned_role_build |
AuthorizeError::BuildUnsignedRoleEntity |
Role field in unsigned request is not a string or array of strings |
Source: MultiIssuerValidationError, MultiIssuerEntityError
| Key | Source | Description |
|---|---|---|
multi_issuer.empty_token_array |
MultiIssuerValidationError::EmptyTokenArray |
The tokens array in the request is empty |
multi_issuer.token_input_invalid |
MultiIssuerValidationError::TokenInput |
Token input has empty mapping string or empty payload |
multi_issuer.all_tokens_failed |
MultiIssuerValidationError::TokenValidationFailed |
All tokens in the request failed validation, no valid tokens to evaluate |
multi_issuer.invalid_context |
MultiIssuerValidationError::InvalidContextJson |
The optional context field is not valid JSON |
multi_issuer.missing_issuer |
MultiIssuerValidationError::MissingIssuer |
A JWT is missing the iss claim, cannot determine trusted issuer |
multi_issuer.entity_build |
AuthorizeError::MultiIssuerEntity |
Entity construction failed for multi-issuer flow (missing exp, invalid UID, no valid tokens, attribute build failure) |
Source: DataError
| Key | Source | Description |
|---|---|---|
data.invalid_key |
DataError::InvalidKey |
Push or get called with an empty key |
data.key_not_found |
DataError::KeyNotFound |
Requested key does not exist in the data store |
data.storage_limit |
DataError::StorageLimitExceeded |
max_entries reached, cannot push more data |
data.ttl_exceeded |
DataError::TTLExceeded |
Requested TTL is larger than max_ttl configured |
data.value_too_large |
DataError::ValueTooLarge |
Entry size exceeds max_entry_size
|
data.serialization |
DataError::SerializationError |
JSON serialization of the value failed |
Source: LogWorker, RestTransport
| Key | Source | Description |
|---|---|---|
lock.send_failed |
LogWorker::flush_logs retry path |
Failed to send log batch to Lock Server (network error, 4xx/5xx response) |
lock.channel_full |
LockService::log_any try_send error |
The mpsc channel to LogWorker is full, logs are being produced faster than sent |
lock.malformed_entry |
RestTransport::send_logs skip |
Log entry could not be deserialized or mapped to Lock Server format |
Operational metrics collected during the telemetry interval. Values are either counters (reset each interval) or gauges (point-in-time snapshot at flush time).
| Key | Type | Description |
|---|---|---|
authz.requests_total |
counter | Total number of authorization requests (authorize_unsigned + authorize_multi_issuer) |
authz.requests_unsigned |
counter | Number of authorize_unsigned calls |
authz.requests_multi_issuer |
counter | Number of authorize_multi_issuer calls |
authz.decision_allow |
counter | Number of requests that resulted in ALLOW |
authz.decision_deny |
counter | Number of requests that resulted in DENY |
authz.errors_total |
counter | Total number of requests that returned an error (did not reach a decision) |
Latency is reported using percentiles instead of averages. Averages hide outliers — a single slow request can go unnoticed if averaged with thousands of fast ones. Percentiles give a full distribution picture: P50 shows typical latency, P95 shows degraded experience for tail users, P99/max detect spikes and worst-case scenarios.
| Key | Type | Description |
|---|---|---|
authz.last_eval_time_us |
gauge | Last policy evaluation time in microseconds |
authz.eval_time_p50_us |
gauge | Median (50th percentile) evaluation time in microseconds — typical request latency |
authz.eval_time_p95_us |
gauge | 95th percentile evaluation time — latency experienced by the slow tail |
authz.eval_time_p99_us |
gauge | 99th percentile evaluation time — worst-case excluding extreme outliers |
authz.eval_time_max_us |
gauge | Maximum evaluation time in the interval — detects spikes |
authz.principals_per_request_avg |
gauge | Average number of principals evaluated per unsigned request |
| Key | Type | Description |
|---|---|---|
token_cache.hits |
counter | Number of cache hits (token reused without re-validation) |
token_cache.misses |
counter | Number of cache misses (full validation required) |
token_cache.size |
gauge | Current number of entries in the token cache |
token_cache.evictions |
counter | Number of entries evicted (TTL expiry or capacity) |
| Key | Type | Description |
|---|---|---|
jwt.validations_total |
counter | Total number of individual JWT validations attempted |
jwt.validations_success |
counter | Number of JWTs that passed validation |
jwt.validations_failed |
counter | Number of JWTs that failed validation (sum should match individual error_counters jwt.* keys) |
jwt.tokens_skipped_untrusted |
counter | Tokens skipped in multi-issuer flow because issuer was not trusted (warning, not error) |
| Key | Type | Description |
|---|---|---|
data.entries_count |
gauge | Current number of entries in the pushed data store |
data.total_size_bytes |
gauge | Total memory used by data store entries |
data.push_ops |
counter | Number of push_data_ctx calls |
data.get_ops |
counter | Number of get_data_ctx calls |
data.remove_ops |
counter | Number of remove_data_ctx calls |
data.ttl_expirations |
counter | Number of entries that expired due to TTL |
data.memory_alert_triggered |
gauge | 1 if memory alert threshold was crossed during the interval, 0 otherwise |
| Key | Type | Description |
|---|---|---|
lock.batches_sent |
counter | Number of log batches successfully sent to Lock Server |
lock.entries_sent |
counter | Total number of log entries sent |
lock.retries |
counter | Number of retry attempts for failed sends |
lock.queue_depth |
gauge | Current number of unsent entries in the log buffer |
| Key | Type | Description |
|---|---|---|
instance.uptime_secs |
gauge | Seconds since Cedarling instance was initialized |
instance.memory_usage_bytes |
gauge | Process memory usage in bytes (RSS) |
instance.policy_count |
gauge | Number of policies in the loaded policy store |
instance.trusted_issuers_loaded |
gauge | Number of successfully loaded trusted issuers |
instance.trusted_issuers_failed |
gauge | Number of trusted issuers that failed to load |