Skip to content

[Feature] Server observability — metrics, tracing, structured logs #76

Description

@alxsuv

Goal

Instrument hensu-server end-to-end so operators can answer three questions without attaching a debugger: which workflow is slow, which provider is burning tokens, and where did this execution backtrack.
The server currently emits zero application-level telemetry.

Scope

Add four Quarkus extensions to hensu-server only (hensu-core stays dependency-free):

  • quarkus-micrometer-registry-prometheus — metrics export at /q/metrics.
  • quarkus-opentelemetry — trace export (OTLP, configurable endpoint).
  • quarkus-smallrye-context-propagation — MDC + OTel context across virtual threads and Mutiny boundaries.
  • quarkus-logging-json — structured log output.

Verify clean GraalVM native build (./gradlew :hensu-server:build -Dquarkus.native.enabled=true) before merging. No reflect-config.json edits expected; if any extension demands them, document the entry.

Metrics to emit

Workflow-level:

  • hensu.workflow.active — gauge, tags: workflow_name.
  • hensu.workflow.duration — timer, tags: workflow_name, status.
  • hensu.workflow.backtracks.total — counter, tags: workflow_name, node_id.

Node / agent-level:

  • hensu.node.latency — timer, tags: node_type, provider, model.
  • hensu.node.errors — counter, tags: provider, error_type (e.g. rate_limit, timeout, parse_error).

Cost:

  • hensu.llm.tokens.prompt — counter, tags: provider, model, workflow_name.
  • hensu.llm.tokens.completion — counter, tags: provider, model, workflow_name.

State / persistence:

  • hensu.state.snapshot.latency — timer (JDBC repo write path).

Concurrency:

  • hensu.workflow.branches.active — gauge, fork/join fan-out saturation.

Spans to emit

  • WorkflowExecution (root, attrs: workflow_id, workflow_name, execution_id).
  • SubWorkflowExecution (child).
  • NodeExecution (child, attrs: node_id, node_type, provider, model).
  • StateSnapshot (child, attrs: repository, snapshot_size_bytes).

Log correlation

Inject MDC keys at the WorkflowExecutor entry point: workflow_id, execution_id, node_id (when in a node scope). trace_id and span_id come from quarkus-opentelemetry automatically. Honor the
ThreadLocal ban — rely on Quarkus context propagation rather than hand-rolling MDC.

Non-goals

  • Custom dashboards, Grafana JSON, alert rules — out of scope; covered by a follow-up ops task.
  • Distributed tracing across multiple Hensu binaries — single-binary engine for now.
  • Self-hosted collector setup — emit OTLP, let the deployment environment provide the sink.
  • Legacy JVM / HTTP metric curation — Micrometer defaults are fine.

Acceptance

  • Native build (-Dquarkus.native.enabled=true) passes locally.
  • Running a sample workflow against a live OpenAI key produces non-zero values for every metric in the list above.
  • A traced execution shows the full WorkflowExecution → NodeExecution → StateSnapshot waterfall in any OTLP-compatible viewer (Jaeger, Tempo, Honeycomb).
  • Log lines emitted from inside a node carry workflow_id, execution_id, node_id, and trace_id even when execution crosses virtual-thread boundaries.

Metadata

Metadata

Assignees

Labels

area: native-imageGraalVM compilation, reflection metadata, and SubstrateVM issuesarea: serverQuarkus backend, REST endpoints, and SSE communicationgood first issueExtra attention is neededtype: featureNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions