Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
.git
.gitignore
.gitattributes
.gitleaks.toml
.pre-commit-config.yaml

# Python
# Python artifacts
__pycache__/
*.py[cod]
*$py.class
Expand All @@ -13,6 +15,9 @@ __pycache__/
venv/
ENV/
env/
build/
dist/
*.egg-info/

# Testing
.pytest_cache/
Expand All @@ -37,19 +42,15 @@ docs/
!README.md
tmp_docs/

# Databases
# Runtime data
*.db
*.sqlite
*.sqlite3
sync_state/
cron_state.db

# Environment
.env
.env.*

# Build artifacts
build/
dist/
*.egg-info/
# Environment and secrets
.env*

# Logs
*.log
Expand All @@ -58,7 +59,11 @@ dist/
.DS_Store
Thumbs.db

# Tests
# CI
.github

# Claude
.claude

# Tests (not needed in production image)
tests/
test_*.py
*_test.py
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ repos:
name: pip-audit
#ignore 2026-4539 as not fixed
#ignore 2026-25645 due to cooldown
entry: uv run --with pip-audit pip-audit --ignore-vuln CVE-2026-4539 --ignore-vuln CVE-2026-25645
entry: uv run --with pip-audit pip-audit --ignore-vuln CVE-2025-3000 --ignore-vuln PYSEC-2025-217 --ignore-vuln CVE-2026-1839 --ignore-vuln=GHSA-537c-gmf6-5ccf --ignore-vuln=CVE-2026-54283 --ignore-vuln=CVE-2026-54282
language: system
pass_filenames: false
files: (^pyproject\.toml$|^uv\.lock$)
Expand Down
76 changes: 52 additions & 24 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Instructions for AI coding agents working with Soliplex Agents.

## Project Overview

Document ingestion agents that load files from multiple sources (filesystem, WebDAV, GitHub, Gitea) into Soliplex Ingester for processing and indexing.
Document ingestion agents that collect files from multiple sources (filesystem, WebDAV, web, GitHub, Gitea) and write them to a local download directory, with an optional haiku-rag load step that indexes them into per-source LanceDB databases.

**Stack:** Python 3.13+, FastAPI, aiohttp, Typer CLI, Pydantic v2

Expand Down Expand Up @@ -35,28 +35,34 @@ si-agent scm run-incremental gitea myowner/myrepo
```text
src/soliplex/agents/
├── cli.py # Main Typer CLI entry point
├── client.py # Ingester API client (HTTP operations)
├── config.py # Pydantic settings
├── config.py # Pydantic settings + manifest models
├── local_state.py # Local sync state (content hashes, commit SHAs)
├── local_store.py # Writing documents + .meta.json sidecars to DOWNLOAD_DIR
├── retry.py # Retry helpers
├── common/
│ └── config.py # File validation utilities
├── fs/ # Filesystem agent
│ ├── cli.py # CLI commands
│ └── app.py # Business logic
├── fs/ # Filesystem agent (cli.py + app.py)
├── scm/ # Source control agent
│ ├── cli.py # CLI commands
│ ├── app.py # SCM orchestration
│ ├── base.py # BaseSCMProvider abstract class
│ ├── git_cli.py # Git CLI decorator (local clone mode)
│ ├── github/ # GitHub provider implementation
│ ├── gitea/ # Gitea provider implementation
│ └── lib/
│ ├── utils.py # Hashing utilities
│ └── templates/ # Jinja2 templates
├── webdav/ # WebDAV agent
│ ├── cli.py
│ └── app.py
├── webdav/ # WebDAV agent (cli.py + app.py + async_client.py)
├── web/ # Web agent (app.py)
├── manifest/ # Declarative multi-source runner
│ ├── cli.py # `manifest run` command
│ ├── runner.py # load_manifest / run_manifest dispatch
│ └── haiku_loader.py # haiku-rag batch load subprocess
└── server/ # FastAPI REST API
├── __init__.py # App setup, CORS, scheduler
├── __init__.py # App setup, CORS, scheduler, lifespan
├── auth.py # Authentication
├── locks.py # Per-manifest execution locks
├── haiku_queue.py # Global FIFO queue serializing haiku-rag loads
└── routes/ # API endpoints
```

Expand Down Expand Up @@ -85,11 +91,11 @@ Use `soliplex.agents` (dot notation):

```python
# Correct
from soliplex.agents.client import IngesterClient
from soliplex.agents.config import get_settings
from soliplex.agents.config import settings
from soliplex.agents.manifest import runner

# Incorrect
from soliplex_agents.client import IngesterClient
from soliplex_agents.config import settings
```

### Hashing Algorithms
Expand Down Expand Up @@ -118,14 +124,14 @@ uv run pytest
uv run pytest --cov-report=html

# Run specific test
uv run pytest tests/unit/test_client.py
uv run pytest tests/unit/test_manifest_runner.py
```

**Requirements:**
- 100% branch coverage for non-excluded code
- Unit tests in `tests/unit/`
- Functional tests in `tests/functional/` (skipped by default)
- Mock external services (Ingester API, GitHub, Gitea)
- Mock external services and subprocesses (GitHub, Gitea, `haiku-ingester`)

**Coverage Exclusions:**
- `*/cli.py` - CLI modules
Expand All @@ -138,7 +144,17 @@ uv run pytest tests/unit/test_client.py
### Required

```bash
ENDPOINT_URL=http://localhost:8000/api/v1 # Ingester API
DOWNLOAD_DIR=downloads # Where fetched documents are written
STATE_DIR=sync_state # Local sync state (one SQLite file per source)
```

### haiku-rag Loading

```bash
HAIKU_LOAD_ENABLED=true # Queue a haiku-rag load after each manifest run
LANCEDB_DIR=/var/lib/lancedb # Base dir for per-source <source>.lancedb
HAIKU_PATH=/etc/haiku # Base dir for haiku-rag config files
# HAIKU_LOAD_COMMAND, HAIKU_DEFAULT_CONFIG, HAIKU_LOAD_TIMEOUT, HAIKU_LOAD_CWD also available
```

### SCM Authentication
Expand Down Expand Up @@ -187,6 +203,8 @@ si-agent
│ ├── validate-config <path>
│ ├── check-status <path> <source>
│ └── run-inventory <path> <source>
├── manifest
│ └── run <path> [--json] [--load/--no-load] # Run manifest(s); optionally haiku-rag load
└── serve [--host] [--port] [--reload]
```

Expand Down Expand Up @@ -231,21 +249,30 @@ def get_provider(platform: str) -> BaseSCMProvider:

**Git CLI Decorator:** When `scm_use_git_cli=true`, the decorator intercepts file operations to use local git clone instead of API calls. API-only operations (issues, repo management) are delegated to the wrapped provider.

### Batch Management
### Per-Source Storage

Each manifest maps to one `source`. All of a source's documents live under
`<DOWNLOAD_DIR>/<sanitized-source>/`, with one SQLite sync-state file per
source under `STATE_DIR`. Content hashes recorded in sync state enable
incremental ingestion (only new/changed files are written).

### haiku-rag Load Serialization

Files are grouped into batches by source:
- System reuses existing batch if source matches
- Creates new batch only if none exists
- Enables incremental ingestion (only new/changed files)
After each manifest run (scheduler, startup, or CLI), a `haiku-ingester`
load is queued for the source. Inside the server a single worker drains a
global FIFO queue (`server/haiku_queue.py`), so only one load runs at a
time; the CLI runs loads sequentially for the same effect. The subprocess
inherits the parent environment plus injected `SOURCE` (sanitized
download-folder name) and `DOWNLOAD_DIR`. See `manifest/haiku_loader.py`.

### Incremental Sync (SCM)

Commit-based tracking for efficient syncing:
1. Get last processed commit SHA from Ingester
1. Get last processed commit SHA from local sync state
2. Fetch commits since that SHA
3. Extract changed file paths
4. Download only modified files
5. Store new commit SHA
5. Store new commit SHA in local sync state

## File Organization

Expand All @@ -259,7 +286,8 @@ When adding features:

- Do not mix hashing algorithms (SHA256 vs SHA3-256)
- Always use async/await for I/O operations
- Batch names must be unique per source
- Manifest IDs must be unique when running a directory of manifests
- Only one haiku-rag load runs at a time (capacity constraint)
- WebDAV requires SSL verification by default
- SCM providers must implement `BaseSCMProvider` interface

Expand Down
40 changes: 29 additions & 11 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Soliplex Ingester-Agents

Document ingestion CLI for loading files from filesystem, WebDAV, SCM platforms, and web pages into Soliplex Ingester. Supports declarative YAML manifests for multi-source ingestion.
Document ingestion CLI for collecting files from filesystem, WebDAV, SCM platforms, and web pages into a local download directory. Supports declarative YAML manifests for multi-source ingestion and an optional haiku-rag load step that indexes each source into LanceDB.

## Quick Reference

Expand Down Expand Up @@ -42,6 +42,7 @@ si-agent webdav validate-config <path>
# Manifest runner
si-agent manifest run <path> # Run manifest file or directory
si-agent manifest run <path> --json # Output results as JSON
si-agent manifest run <path> --load # Also run a haiku-rag load per manifest

# REST API server
si-agent serve
Expand All @@ -64,8 +65,9 @@ si-agent serve --reload
```text
src/soliplex/agents/
├── cli.py # Main Typer entry point
├── client.py # Ingester API client (batch, status, ingest, sync state)
├── config.py # Pydantic Settings, manifest/component models
├── local_state.py # Local sync state (hashes, commit SHAs, pruning)
├── local_store.py # Writes documents + .meta.json to DOWNLOAD_DIR
├── common/config.py # Shared validation utilities
├── fs/ # Filesystem agent
│ ├── cli.py # CLI commands
Expand All @@ -77,6 +79,7 @@ src/soliplex/agents/
│ └── app.py # Business logic
├── manifest/ # Manifest runner
│ ├── runner.py # YAML loading, validation, agent dispatch
│ ├── haiku_loader.py # haiku-rag batch load subprocess
│ └── cli.py # CLI commands
├── scm/ # SCM agent
│ ├── cli.py # CLI commands
Expand All @@ -88,8 +91,10 @@ src/soliplex/agents/
│ ├── utils.py # SHA3-256 hashing, base64 decoding
│ └── templates/ # Jinja2 issue rendering
└── server/ # FastAPI REST API
├── __init__.py # App setup, CORS, scheduler
├── __init__.py # App setup, CORS, scheduler, lifespan
├── auth.py # API key and OAuth2 proxy auth
├── locks.py # Per-manifest execution locks
├── haiku_queue.py # Global FIFO queue serializing haiku-rag loads
└── routes/ # Endpoint handlers (fs, scm, webdav, web, manifest)
```

Expand All @@ -99,10 +104,13 @@ Key environment variables:

```bash
# Required
ENDPOINT_URL=http://localhost:8000/api/v1
DOWNLOAD_DIR=downloads # Where fetched documents are written
STATE_DIR=sync_state # Local sync state, one SQLite file per source

# Ingester authentication
INGESTER_API_KEY=your-key
# haiku-rag loading (optional)
HAIKU_LOAD_ENABLED=true # Queue a haiku-rag load after each manifest run
LANCEDB_DIR=/var/lib/lancedb # Base dir for per-source <source>.lancedb
HAIKU_PATH=/etc/haiku # Base dir for haiku-rag config files

# SCM authentication
scm_auth_token=your-token
Expand Down Expand Up @@ -134,20 +142,30 @@ MANIFEST_DIR=/path/to/manifests # Directory with manifest .yml files

## Key Patterns

### Batch Management
### Per-Source Storage

Documents are grouped into batches by source name. The system reuses existing batches for incremental ingestion.
Each manifest maps to one `source`. Its documents live under
`<DOWNLOAD_DIR>/<sanitized-source>/`, with one SQLite sync-state file per
source under `STATE_DIR`. Recorded content hashes drive incremental
ingestion.

### haiku-rag Load Serialization

After each manifest run, a `haiku-ingester` load is queued for the source.
Inside the server one worker drains a global FIFO queue
(`server/haiku_queue.py`) so only one load runs at a time; the CLI runs
loads sequentially. See `manifest/haiku_loader.py`.

### Status Checking

Files are hashed and compared against the Ingester database:
Files are hashed and compared against the local sync state:
- **new:** File does not exist
- **mismatch:** File changed (hash differs)
- **match:** File unchanged (skipped)

### Incremental Sync (SCM)

The run-incremental command tracks the last processed commit SHA to only fetch changed files on subsequent runs.
The run-incremental command tracks the last processed commit SHA in local sync state to only fetch changed files on subsequent runs.

## Testing

Expand All @@ -156,7 +174,7 @@ The run-incremental command tracks the last processed commit SHA to only fetch c
uv run pytest tests/unit/

# Specific test file
uv run pytest tests/unit/test_client.py
uv run pytest tests/unit/test_manifest_runner.py

# Coverage report
uv run pytest --cov-report=html
Expand Down
Loading