Turn any AI coding agent into a self-learning knowledge engine.
A modular, instruction-driven framework that teaches AI agents (Claude, GPT, Cursor, Codex, Gemini) to build, maintain, and evolve a structured personal knowledge base β powered by local-first NLP and automatic context indexing.
Quick Start Β· Features Β· Architecture Β· Examples Β· Π ΡΡΡΠΊΠΈΠΉ
AI Knowledge Engine is a set of modular Markdown instructions that any AI coding agent can read and execute to:
- ποΈ Index your codebase β Pack your project into a single AI-readable context file (5 min setup)
- π§ Build a knowledge base β Create a full Raw-First Knowledge Pipeline with NLP enrichment, provenance tracking, self-learning, and automated maintenance
No SaaS. No API keys. No cloud. Everything runs locally on your machine.
Indexer-agnostic: Ships with Repomix support out of the box, but the architecture is designed to work with any codebase-to-context tool.
| Mode | What you get | Setup time |
|---|---|---|
Lite β quick-start/ |
AI-optimized codebase index with auto-update on git commit | ~5 minutes |
Full β knowledge-base/ |
Personal knowledge engine with NLP, self-learning loop, AI review queue, health checks, and smart scheduling | ~30 minutes |
# 1. Install the indexer
npm install -g repomix
# 2. Copy quick-start/ into your project
cp -r quick-start/ /path/to/your-project/docs/ai-init/
# 3. Tell your AI agent:
"Read docs/ai-init/INIT_GUIDE.md and set up context indexing for this project"Your AI agent will analyze the project structure, configure the indexer, set up git hooks, and generate the first context snapshot.
# 1. Copy this repo's knowledge-base/ into your project as `setup/`
cp -r knowledge-base/ /path/to/your-project/setup/
# 2. Open the project in an IDE with an AI agent
cd /path/to/your-project
# 3. In the chat, send EXACTLY this:
#
# "Read setup/README.md and setup/00_OVERVIEW.md, then deploy a
# knowledge base for [your role] inside ./knowledge-base/.
# When kb_doctor passes, run setup/shell/finalize.sh to flatten
# the base into the project root."The agent will:
- Ask clarifying questions about your role (or invent a custom one if no built-in fits)
- Build the base inside
./knowledge-base/while the originalsetup/stays put - Parameterize
kb.config.yml,AGENTS.md,KNOWLEDGE_STRUCTURE.md - Generate
DATA_PLACEMENT_EXAMPLES.md(deterministic, no tokens) viakb_populate.py - Generate
START_HERE.mdβ your first read after deployment - Run
kb_doctor.pyto confirm everything is wired - Run
bash setup/shell/finalize.shβ promotesknowledge-base/contents to the project root, removes bothsetup/andknowledge-base/
After deployment your project root looks like this:
your-project/
βββ START_HERE.md β read this first
βββ AGENTS.md β agent instructions
βββ kb.config.yml β config
βββ DATA_PLACEMENT_EXAMPLES.md β what to drop where (role-specific)
βββ reindex.command β macOS double-click
βββ watcher-start.command β macOS double-click
βββ watcher-stop.command β macOS double-click
βββ reindex.bat, watcher-start.bat β Windows double-click
βββ shell/ β Linux/CLI: watcher.sh, reindex.sh, lint.sh, doctor.sh
βββ scripts/ β Python pipeline
βββ templates/, examples/
βββ raw/, processed/, knowledge/, assets/, assets-index/, review/, interactions/
βββ .repomix/ β AI-ready output
β οΈ Critical: every new chat session, start with the line: "Read AGENTS.md and use it as the primary instruction for everything that follows." Without this the agent has no idea your knowledge base exists.START_HERE.mdreminds you of this.
- π¦ One-command setup β AI agent handles the entire configuration
- π Auto-update β Git hooks regenerate context on every commit
- π― Stack-aware β Pre-configured patterns for React, Rust, Python, Go, Node.js, and more
- π Security scanning β Detects leaked secrets before indexing
- π Token budget control β Tree-sitter compression reduces token count by 50-70%
- π Profile support β Separate context files per subsystem (backend, frontend, infra)
- π¬ Raw-First Pipeline β Drop PDFs, DOCX, PPTX into
raw/β auto-convert to Markdown β NLP enrichment β clean knowledge - π§ Self-Learning Loop β
!savesessions,!reflectfor higher-level insights,!auditfor comprehensive review - π Cross-Linked Knowledge β
[[wikilinks]]+ routing tables for scalable navigation across hundreds of pages - π NLP Enrichment β Named Entity Recognition, keyword extraction, entity resolution (spaCy + KeyBERT) β zero tokens, pure CPU
- π Provenance Tracking β Every fact traced to its original source with hash verification and span-level citations
- π Surprise Filter β Anti-duplication: only genuinely new information enters the knowledge base
- βοΈ Health Checks β Python-based lint for stale pages, orphans, broken links, and contradictions
- β° Smart Scheduling β Auto-reflection when importance threshold is reached; skips when idle to save tokens
- π Privacy-by-Default β Raw data, reviews, and interaction logs are never indexed
- π Fully Portable β Pure Markdown files, no databases, no servers, works on any machine with
rsync
βββββββββββββββ
User ββββββββββββββ β raw/ β β PDFs, DOCX, notes, chats, screenshots
ββββββββ¬βββββββ
β kb_ingest.py (Python + NLP)
βΌ
βββββββββββββββ
β processed/ β β Markdown + NLP metadata (0 tokens)
ββββββββ¬βββββββ
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
βknowledge/β β review/ β β assets/ β
β (clean) β β(complex) β β(binary) β
ββββββ¬ββββββ ββββββββββββ ββββββββββββ
β
βΌ context indexer
ββββββββββββββββ
β output.xml β β AI-ready context snapshot
ββββββββββββββββ
Token consumption depends on the operating mode (mode in kb.config.yml):
| Tier | Operations | default | super |
|---|---|---|---|
| Python (free) | NLP, lint L1, conversion | 0 tok | 0 tok |
| Light AI | Importance scoring | ~500 | ~1-2K |
| Mode-switched | Surprise filter, annotations, entity resolution | 0 tok (Python) | ~3-9K tok (AI) |
| Heavy AI | Reflection, deep review, writeback | ~15-100K | ~15-100K |
| Mode | Daily (active) | Weekly |
|---|---|---|
default |
~3-4K tokens | ~20-30K |
super |
~50-200K+ tokens | ~500K-1.5M |
The full mode creates a rich, opinionated folder hierarchy at the project root (no nested knowledge-base/):
your-project/
βββ START_HERE.md # Read this first
βββ AGENTS.md # AI agent instructions (auto-generated)
βββ KNOWLEDGE_STRUCTURE.md # This map
βββ DATA_PLACEMENT_EXAMPLES.md # "Got X β put it here" (role-specific)
βββ kb.config.yml # Config: role, entities, mode
βββ repomix.config.json # Indexer config
βββ requirements.txt
β
β # Double-click launchers (macOS / Windows) live at the root:
βββ reindex.command, reindex.bat # Manual reindex
βββ watcher-start.command, watcher-start.bat # Auto pipeline
βββ watcher-stop.command # macOS daemon stop
β
βββ shell/ # Linux / CLI: *.sh wrappers
β βββ watcher.sh, reindex.sh
β βββ lint.sh, doctor.sh
β βββ (launchers were promoted to root by finalize.sh)
β
βββ scripts/ # Python pipeline (do not modify lightly)
β βββ kb_ingest.py # Raw β processed β knowledge
β βββ kb_lint.py # Health check (Level 1)
β βββ kb_reflect.py # Reflection trigger logic
β βββ kb_watch.py # File watcher daemon
β βββ kb_nlp_batch.py # Batch NLP re-enrichment
β βββ kb_populate.py # Generate DATA_PLACEMENT_EXAMPLES.md
β βββ kb_doctor.py # Smoke test
β βββ kb_common.py # Shared utilities
βββ templates/ # Kept for re-runs (kb_populate, kb_upgrade)
βββ examples/ # Role YAMLs
β
βββ raw/ # π« Raw materials (NOT indexed)
β βββ documents/unsorted/
β βββ reference/unsorted/
β βββ work/unsorted/
β βββ chats/unsorted/
β βββ media/unsorted/
β βββ personal-context/unsorted/
β βββ unsorted/
β
βββ processed/ # π« Converted artifacts (NOT indexed)
βββ assets/ # π« Binary originals (NOT indexed)
β
βββ knowledge/ # β
Clean knowledge β INDEXED
β βββ profile/, principles/, voice/
β βββ domain/, projects/, decisions/
β βββ playbooks/, insights/, opinions/
β βββ timelines/, routing/
β βββ open-questions/
β βββ _archive/
β
βββ assets-index/ # β
Markdown descriptions of assets β INDEXED
βββ review/ # π« Review queues (NOT indexed)
β βββ needs-classification/
β βββ needs-ai-decision/
β βββ needs-redaction/
β βββ excluded-sensitive/
β
βββ interactions/ # π« Session logs (NOT indexed directly)
βββ .repomix/output.xml # π« Generated index (regenerated locally)
The watcher monitors raw/<sub>/unsorted/ and runs the ingest pipeline whenever you drop a file there.
| Action | File |
|---|---|
| Start watcher | Double-click watcher-start.command |
| Stop watcher | Press Ctrl+C in the opened Terminal window β or, if you started in daemon mode, double-click watcher-stop.command |
| Manual reindex | Double-click reindex.command |
./shell/watcher.sh # foreground, Ctrl+C to stop
./shell/watcher.sh --daemon # background
./shell/watcher.sh --status
./shell/watcher.sh --stop
./shell/reindex.sh # one-shot reindex| Action | File |
|---|---|
| Start watcher | Double-click watcher-start.bat |
| Stop watcher | Close the cmd window or Ctrl+C |
| Manual reindex | Double-click reindex.bat |
Commands you tell your AI agent in the IDE chat.
π¨ Before any command works in a new chat, send first: "Read AGENTS.md and use it as the primary instruction for everything that follows."
| Command | What it does | Cost | When to use |
|---|---|---|---|
!save |
Save session summary with key decisions and insights | ~2K tokens | After productive sessions (45+ min) |
!reflect |
Synthesize higher-level insights from accumulated facts | ~15K tokens | Auto-triggered or on demand |
!audit |
Full AI review: contradictions, gaps, merge candidates | ~50β100K tokens | Every 2β4 weeks |
!review |
Process the review/ queues β turn flagged materials into knowledge/ pages, redact sensitive content, ask questions when input is needed |
~5β30K tokens | When review/needs-ai-decision/ accumulates after ingest |
!populate |
Re-generate DATA_PLACEMENT_EXAMPLES.md (run after editing your role YAML) |
~50 tokens | After tweaking examples/<role>.yml |
!super |
Toggle operating mode: default β super | 0 tokens | When you need maximum learning speed |
!super on/off |
Explicitly enable/disable super mode | 0 tokens | See Operating Modes |
!super status |
Show current operating mode | 0 tokens | Quick check |
The system supports two operating modes, controlled by !super command:
| Mode | Paradigm | Token Cost | Best For |
|---|---|---|---|
| default | Python-first, throttled | ~3-4K/day | Limited token budgets, daily use |
| super | AI-first, on-demand | ~50-200K+/day | Unlimited plans, intensive knowledge building |
Default mode uses Python-first processing (NLP, heuristic filters) and throttled AI schedules. AI is called only for importance scoring and large document surprise checks.
Super mode replaces Python heuristics with full AI reasoning for every operation:
- π Semantic surprise detection β AI evaluates every ingest for genuinely new information (+40% accuracy vs Python NLP overlap)
- π Intelligent annotations β AI generates meaningful cross-references with suggested edits (+60% usefulness vs template annotations)
- π Cross-language entity resolution β AI understands synonyms and multilingual variants (+30% coverage)
- β‘ On-demand reflection β Triggers after every significant ingest (importance β₯5) instead of weekly schedule
- π§ͺ Daily AI audit β Lint Level 2 runs automatically during consolidation
- π₯ Auto review processing β
review/needs-ai-decision/is processed without waiting for!audit
β οΈ Warning: Super mode can consume your entire daily token budget in a single active session. Use only with unlimited or high-limit AI plans.
The system tracks an importance score for each ingested item. When the cumulative score exceeds a threshold (default: 25, super: 5), reflection runs automatically. If nothing changed β no tokens are spent.
Days without reflection: 1 2 3 4 5 6 7 8 9
Changes? - - - - - - - - β
β
Trigger! (>7 days + changes exist)
The knowledge base system is built from modular instruction files that the AI agent reads sequentially:
| # | Module | Purpose |
|---|---|---|
| 00 | Overview | Deployment map: what to read, what to copy, in what order (read first) |
| 01 | Prerequisites | Environment check: Node.js, Python, Git, indexer |
| 02 | Init | Role clarification, entity selection, folder creation |
| 03 | Pipeline | Python ingest script: conversion + NLP + source hashing |
| 04 | Review | AI review workflow for complex/ambiguous materials |
| 05 | Index | Context indexing, [[wikilinks]], routing tables |
| 06 | Agents Template | AGENTS.md template with token budget |
| 07 | Interaction Loop | Self-learning + session capture + query writeback |
| 08 | Portable | Portability + Dynamic Context Enrichment |
| 09 | Lint | Health checks: Level 1 (Python) + Level 2 (AI) + --metrics |
| 10 | Log | Append-only operation chronology |
| 11 | Provenance | Source hash, span citations, regression tests |
| 12 | NLP Preprocess | NER + keyword extraction + entity resolution |
| 13 | Autorun | File watcher, git hooks, smart scheduling |
| 14 | Initial Population | Generate role-specific DATA_PLACEMENT_EXAMPLES.md |
Pre-configured examples in knowledge-base/examples/:
| Template | Role | Highlights |
|---|---|---|
programmer-senior.yml |
Senior Software Engineer | Architecture, debugging, tech stack, code principles |
marketing-director.yml |
Marketing Director | Strategy, brand, campaigns, audience analysis |
creative-hybrid.yml |
Creative Hybrid | Code + music production + indie gamedev |
product-manager.yml |
Product Manager | Prioritization, metrics, user research, PRDs |
researcher.yml |
Researcher / Analyst | Literature graph, hypotheses, methodology |
founder.yml |
Startup founder | Investors, hiring, product, decision logs |
content-creator.yml |
Content creator | Voice fingerprinting, audience, monetization |
fiction-writer.yml |
Fiction writer | Craft theory, voice training from influences, draft critique |
Don't see your role? Tell the AI agent your profession β it will generate a custom configuration with relevant entities, knowledge paths, and example workflows.
Tested and designed for:
| Agent | Status | Notes |
|---|---|---|
| Claude (Anthropic) | β Fully supported | Cursor, API, Claude Desktop |
| GPT-4 / GPT-4o | β Fully supported | Cursor, Copilot, ChatGPT |
| Codex CLI | β Fully supported | OpenAI Codex |
| Gemini | β Fully supported | JetBrains AI, Google AI Studio |
| Any Markdown-capable agent | β Compatible | Must read .md and run shell commands |
| Component | Minimum | Required for |
|---|---|---|
| Node.js | 20.0+ | Context indexer (Repomix) |
| Python | 3.11+ | Ingest pipeline, NLP, lint |
| Git | any | Hooks, history tracking |
| IDE with AI | required | Agent interaction |
# Ubuntu / Debian
sudo apt install -y pandoc poppler-utils tesseract-ocr
# macOS
brew install pandoc poppler tesseractContributions are welcome! Here's how you can help:
- π Translations β Translate instruction modules to other languages
- π Role Templates β Add
examples/*.ymlfor new professions - π§ Pipeline Scripts β Improve Python ingest, NLP, and lint scripts
- π Documentation β Clarify instructions, add diagrams, fix typos
- π§ͺ Testing β Try with different AI agents and report compatibility
Please open an issue before starting major work to discuss the approach.
MIT β Free for personal and commercial use.
Built for humans who talk to AI.
If this project helps you build a better knowledge workflow β β give it a star.