Goal
Instrument hensu-server end-to-end so operators can answer three questions without attaching a debugger: which workflow is slow, which provider is burning tokens, and where did this execution backtrack.
The server currently emits zero application-level telemetry.
Scope
Add four Quarkus extensions to hensu-server only (hensu-core stays dependency-free):
quarkus-micrometer-registry-prometheus — metrics export at /q/metrics.
quarkus-opentelemetry — trace export (OTLP, configurable endpoint).
quarkus-smallrye-context-propagation — MDC + OTel context across virtual threads and Mutiny boundaries.
quarkus-logging-json — structured log output.
Verify clean GraalVM native build (./gradlew :hensu-server:build -Dquarkus.native.enabled=true) before merging. No reflect-config.json edits expected; if any extension demands them, document the entry.
Metrics to emit
Workflow-level:
hensu.workflow.active — gauge, tags: workflow_name.
hensu.workflow.duration — timer, tags: workflow_name, status.
hensu.workflow.backtracks.total — counter, tags: workflow_name, node_id.
Node / agent-level:
hensu.node.latency — timer, tags: node_type, provider, model.
hensu.node.errors — counter, tags: provider, error_type (e.g. rate_limit, timeout, parse_error).
Cost:
hensu.llm.tokens.prompt — counter, tags: provider, model, workflow_name.
hensu.llm.tokens.completion — counter, tags: provider, model, workflow_name.
State / persistence:
hensu.state.snapshot.latency — timer (JDBC repo write path).
Concurrency:
hensu.workflow.branches.active — gauge, fork/join fan-out saturation.
Spans to emit
WorkflowExecution (root, attrs: workflow_id, workflow_name, execution_id).
SubWorkflowExecution (child).
NodeExecution (child, attrs: node_id, node_type, provider, model).
StateSnapshot (child, attrs: repository, snapshot_size_bytes).
Log correlation
Inject MDC keys at the WorkflowExecutor entry point: workflow_id, execution_id, node_id (when in a node scope). trace_id and span_id come from quarkus-opentelemetry automatically. Honor the
ThreadLocal ban — rely on Quarkus context propagation rather than hand-rolling MDC.
Non-goals
- Custom dashboards, Grafana JSON, alert rules — out of scope; covered by a follow-up ops task.
- Distributed tracing across multiple Hensu binaries — single-binary engine for now.
- Self-hosted collector setup — emit OTLP, let the deployment environment provide the sink.
- Legacy JVM / HTTP metric curation — Micrometer defaults are fine.
Acceptance
- Native build (
-Dquarkus.native.enabled=true) passes locally.
- Running a sample workflow against a live OpenAI key produces non-zero values for every metric in the list above.
- A traced execution shows the full
WorkflowExecution → NodeExecution → StateSnapshot waterfall in any OTLP-compatible viewer (Jaeger, Tempo, Honeycomb).
- Log lines emitted from inside a node carry
workflow_id, execution_id, node_id, and trace_id even when execution crosses virtual-thread boundaries.
Goal
Instrument
hensu-serverend-to-end so operators can answer three questions without attaching a debugger: which workflow is slow, which provider is burning tokens, and where did this execution backtrack.The server currently emits zero application-level telemetry.
Scope
Add four Quarkus extensions to
hensu-serveronly (hensu-corestays dependency-free):quarkus-micrometer-registry-prometheus— metrics export at/q/metrics.quarkus-opentelemetry— trace export (OTLP, configurable endpoint).quarkus-smallrye-context-propagation— MDC + OTel context across virtual threads and Mutiny boundaries.quarkus-logging-json— structured log output.Verify clean GraalVM native build (
./gradlew :hensu-server:build -Dquarkus.native.enabled=true) before merging. Noreflect-config.jsonedits expected; if any extension demands them, document the entry.Metrics to emit
Workflow-level:
hensu.workflow.active— gauge, tags:workflow_name.hensu.workflow.duration— timer, tags:workflow_name,status.hensu.workflow.backtracks.total— counter, tags:workflow_name,node_id.Node / agent-level:
hensu.node.latency— timer, tags:node_type,provider,model.hensu.node.errors— counter, tags:provider,error_type(e.g.rate_limit,timeout,parse_error).Cost:
hensu.llm.tokens.prompt— counter, tags:provider,model,workflow_name.hensu.llm.tokens.completion— counter, tags:provider,model,workflow_name.State / persistence:
hensu.state.snapshot.latency— timer (JDBC repo write path).Concurrency:
hensu.workflow.branches.active— gauge, fork/join fan-out saturation.Spans to emit
WorkflowExecution(root, attrs:workflow_id,workflow_name,execution_id).SubWorkflowExecution(child).NodeExecution(child, attrs:node_id,node_type,provider,model).StateSnapshot(child, attrs:repository,snapshot_size_bytes).Log correlation
Inject MDC keys at the
WorkflowExecutorentry point:workflow_id,execution_id,node_id(when in a node scope).trace_idandspan_idcome fromquarkus-opentelemetryautomatically. Honor theThreadLocalban — rely on Quarkus context propagation rather than hand-rolling MDC.Non-goals
Acceptance
-Dquarkus.native.enabled=true) passes locally.WorkflowExecution → NodeExecution → StateSnapshotwaterfall in any OTLP-compatible viewer (Jaeger, Tempo, Honeycomb).workflow_id,execution_id,node_id, andtrace_ideven when execution crosses virtual-thread boundaries.