You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-44Lines changed: 16 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,9 +11,9 @@
11
11
📄 arXiv Coming Soon
12
12
</p>
13
13
14
-
Bayesian-Agent is a Bayesian self-evolving layer for turning agent failures into reusable, evidence-weighted Skills and SOPs across agent frameworks and execution harnesses.
14
+
Bayesian-Agent is a Bayesian self-evolving layer for turning verified agent trajectories into reusable, evidence-weighted Skills and SOPs across agent frameworks and execution harnesses.
15
15
16
-
It is designed to stand out from monolithic agent frameworks in three ways:
16
+
It supports three usage patterns:
17
17
18
18
-**Run from scratch**: start with no prior traces and evolve Skills during full benchmark or production runs.
19
19
-**Repair incrementally**: attach to an existing agent, read its failed trajectories, and rerun only the tasks that need repair.
@@ -26,29 +26,23 @@ It is designed to stand out from monolithic agent frameworks in three ways:
26
26
-**2026-06-05:** Added full-sample native-harness results for SOP-Bench, Lifelong AgentBench, and RealFin-Bench with `deepseek-v4-flash` and `deepseek-v4-pro`; see [Experimental Results](#-experimental-results).
27
27
-**2026-06-05:** Added the first-party Bayesian-Agent native harness. It runs its own LLM loop, workspace tools, three-layer memory, and trajectory capture; GenericAgent, mini-swe-agent, and Claude Code remain optional compatibility backends. See the [Native Harness design note](docs/native-harness.md).
28
28
-**2026-05-31:** Added the Bayesian Evidence Model as the default Skill belief backend, with a categorical likelihood implementation and a legacy Beta-Bernoulli backend for ablations.
29
-
-**2026-05-09:** Released Bayesian-Agent v0.4 as a standalone cross-harness Bayesian Skill Evolution package with schemas, CLI utilities, and experiment artifacts.
29
+
-**2026-05-09:** Released Bayesian-Agent as a standalone cross-harness Bayesian Skill Evolution package with schemas, CLI utilities, and experiment artifacts.
30
30
-**2026-05-09:** Added the optional GenericAgent adapter boundary without copying or vendoring GenericAgent.
31
31
-**2026-05-09:** Published bilingual project documentation and the Bayesian-Agent framework diagram.
32
32
33
33
## 🌟 Overview
34
34
35
-
Agent engineering is moving through three layers:
35
+
Prompt Engineering improves task instructions. Context Engineering controls what evidence the model sees at inference time. Harness Engineering puts the model inside an observable, executable, recoverable system with tools, files, tests, memory, logs, and failure recovery.
2.**Context Engineering**: decide what evidence the model can see at inference time.
39
-
3.**Harness Engineering**: put the model inside an observable, executable, recoverable system.
40
-
41
-
Prompting can improve one answer. Context can improve one decision. Harness Engineering is what lets an agent work across tools, files, tests, memory, logs, and failure recovery.
42
-
43
-
In that setting, **Skills** and **SOPs** become first-class engineering assets. A good Skill is not just a longer prompt. It is compressed operational knowledge:
37
+
In this setting, **Skills** and **SOPs** become first-class engineering assets. A good Skill is compressed operational knowledge:
44
38
45
39
- what to inspect first
46
40
- which tools to use
47
41
- how to verify progress
48
42
- which failure modes to avoid
49
43
- when to stop, retry, or rewrite the procedure
50
44
51
-
Bayesian-Agent asks a simple question: if Skills are hypotheses about how to solve tasks, why should they evolve by anecdote instead of evidence? The answer is a framework-agnostic evolution layer that can bootstrap Skills from scratch, repair existing agents incrementally, and move across harnesses as long as they emit verified trajectories.
45
+
Bayesian-Agent treats Skills as hypotheses about how to solve tasks. Verified trajectories become evidence for updating, ranking, rewriting, compressing, or retiring those Skills. The same evolution layer can bootstrap Skills from scratch, repair existing agents incrementally, and move across harnesses that emit compatible trajectories.
@@ -93,7 +87,7 @@ The surface behavior of Bayesian-Agent may look like failure-driven Skill repair
93
87
94
88
Agent runs are expensive: tokens are expensive, latency is high, benchmark cases are limited, and real production failures are even rarer. When samples are scarce, each sample is costly, and we cannot wait for large-sample statistics to stabilize, Bayesian modeling lets Bayesian-Agent combine prior belief, uncertainty, and new verified evidence into more stable decisions.
95
89
96
-
This is why Bayesian-Agent is especially useful for sample-scarce, cost-sensitive, online Skill/SOP evolution. Read the full explanation in [Why Bayesian for Skill Evolution](docs/articles/why-bayesian-for-skill-evolution.md).
90
+
Bayesian-Agent is most useful for sample-scarce, cost-sensitive, online Skill/SOP evolution. Read the full explanation in [Why Bayesian for Skill Evolution](docs/articles/why-bayesian-for-skill-evolution.md).
97
91
98
92
### What "Bayesian" Means in v0.5
99
93
@@ -299,34 +293,6 @@ print(skill_context)
299
293
300
294
`SkillContextBuilder` renders a compact posterior audit view. The built-in SOP/Lifelong runners convert recurring posterior-backed failure modes into executable patches and guardrails before adding them to model prompts.
301
295
302
-
## 🔁 Three Operating Patterns
303
-
304
-
### 🌱 Full Self-Evolving Mode
305
-
306
-
Bayesian-Agent starts from scratch, runs benchmark tasks, collects verified evidence, and evolves Skills during the run.
307
-
308
-
This mode tests whether Bayesian Skill Evolution can improve an agent without relying on prior traces.
309
-
310
-
### 🛠️ Incremental Repair Mode
311
-
312
-
Bayesian-Agent can also attach to an existing agent. The base agent runs first. Bayesian-Agent reads its success and failure traces, updates posterior Skill beliefs, then reruns only the failed tasks.
This is the recommended production path because it improves an existing agent without retraining the model or replacing the original harness.
319
-
320
-
### 🔌 Cross-Harness Adaptation Mode
321
-
322
-
Bayesian-Agent is not tied to a single agent runtime. Any agent framework can become a backend if it emits the common trajectory schema and accepts model-facing Skill/SOP text through an adapter.
323
-
324
-
```text
325
-
Any Agent Harness -> Trajectory Schema -> Bayesian Skill Registry -> Adapter -> Next Harness Run
326
-
```
327
-
328
-
This makes Bayesian-Agent a portable Skill/SOP evolution layer rather than another closed agent framework.
329
-
330
296
## 📊 Experimental Results
331
297
332
298
Bayesian-Agent now has its own native harness. The results below are full-sample runs with no `--limit`: SOP-Bench and Lifelong AgentBench use 20 tasks each, and RealFin-Bench uses 40 tasks.
@@ -357,7 +323,7 @@ Bayesian-Agent now has its own native harness. The results below are full-sample
Compared with the earlier GA-backed artifacts, BA native improves the full RealFin final score on `deepseek-v4-pro` from 68% to 77.5%, but it spends more tokens because the first-party harness deliberately keeps the runtime minimal and lets the model inspect cached market data directly. On SOP/Lifelong, BA native reaches 95-100% full-sample accuracy while using less token budget than the historical GA-backed full runs.
326
+
Compared with earlier GA-backed artifacts, BA native improves the full RealFin final score on `deepseek-v4-pro` from 68% to 77.5%. It spends more tokens because the first-party harness keeps the runtime minimal and lets the model inspect cached market data directly. On SOP/Lifelong, BA native reaches 95-100% full-sample accuracy with lower token use than the historical GA-backed full runs.
361
327
362
328
### 🧱 Published GA Validation: GenericAgent + deepseek-v4-flash
363
329
@@ -399,7 +365,7 @@ The earlier RealFin validation used GenericAgent as the execution backend with `
The result shows that Bayesian-Agent can work as a plug-in repair layer: it can take an existing agent below 100% accuracy and improve it with a small amount of incremental inference. This is the practical advantage over one-off benchmark agents: Bayesian-Agent can sit beside a harness, learn from its failures, and improve it without replacing it.
368
+
In incremental mode, Bayesian-Agent uses an existing agent's failed trajectories, updates Skill beliefs, and reruns only the failed tasks. The repair-only token columns report the additional inference cost.
403
369
404
370
Experiment artifacts are stored under [`artifacts/`](artifacts/) and [`results/`](results/), and the method note is in [`docs/method.md`](docs/method.md). The native harness design note is in [`docs/native-harness.md`](docs/native-harness.md).
405
371
@@ -420,7 +386,7 @@ The script runs three phases by default: selected-harness baseline, Bayesian ful
420
386
421
387
## 🔌 Native Harness and Cross-Harness Adaptation
422
388
423
-
The first prototype was validated inside GenericAgent, but Bayesian-Agent now has its own execution harness. It is not a GenericAgent fork and not just a GenericAgent add-on.
389
+
Bayesian-Agent ships a native harness plus adapter boundaries for external agent runtimes. The first prototype was validated inside GenericAgent; v0.5 keeps GenericAgent as an optional compatibility backend.
424
390
425
391
The open-source structure is:
426
392
@@ -473,6 +439,12 @@ tests/ # Standard-library unittest suite
473
439
-[ ] Add adapters for more agent harnesses after the current boundaries stabilize.
474
440
-[ ] Move beyond the current per-Skill evidence backend toward richer Bayesian reasoning, including Skill hypothesis inference, Bayesian Networks for context/failure structure, uncertainty-aware Skill selection, Bayesian decision policies, and online adaptation.
0 commit comments