This guide will help you get Soliplex Ingester up and running in minutes.
- Python 3.12 or higher
- pip or uv package manager
- SQLite (included with Python) or PostgreSQL
- Docling server for document parsing (optional)
- S3 backend (optional)
For production deployment using Docker Compose, see the comprehensive Docker Deployment Guide.
The docker-compose configuration provides all necessary services:
- PostgreSQL database with initialization scripts
- Docling document parsing services with GPU support and load balancing
- SeaweedFS for S3-compatible object storage
- HAProxy load balancer for high availability
Quick Start with Docker:
cd docker
docker-compose up -dAccess the application at http://localhost:8002
For detailed instructions including:
- Service configuration and scaling
- GPU setup and optimization
- Authentication with OAuth2 Proxy
- Production deployment best practices
- Troubleshooting guide
See DOCKER.md
Using pip:
cd soliplex-ingester
pip install -e .Using uv:
cd soliplex-ingester
uv pip install -e .You can integrate soliplex ingester into another python project by installing it like any other package. This will allow you to use custom methods for any part of the workflow if desired.
uv init --lib <my project name>
uv add https://github.com/soliplex/ingester.git
uv run si-cli bootstrapThis installs the package and makes the si-cli and si-diag commands available.
si-cli --help
si-diag --helpYou should see the CLI help menus.
Automatically configure:
uv run init-envManually create a .env file in the project root:
# Minimum required configuration
DOC_DB_URL=sqlite+aiosqlite:///./db/documents.db
# Optional: Docling service (for document parsing)
DOCLING_SERVER_URL=http://localhost:5001/v1
# Optional: Adjust logging
LOG_LEVEL=INFOLoad the environment:
export $(cat .env | xargs)Or on Windows:
Get-Content .env | ForEach-Object { $var = $_.Split('='); [Environment]::SetEnvironmentVariable($var[0], $var[1]) }Alternatively, si-cli can be run via uv to initialize the environment file
uv run --env-file=.env si-clisi-cli validate-settingsThis should display your configuration without errors.
si-cli db-initThis creates:
- SQLite database file at
db/documents.db - All necessary tables
- Runs migrations
si-cli serve --reloadThe server starts on http://127.0.0.1:8000 with:
- Auto-reload on code changes
- Integrated worker for processing
- Web UI at
/ - OpenAPI docs at
/docs
Web UI (Main Application):
Open your browser and navigate to:
http://localhost:8000/
The web UI provides:
- Dashboard - Monitor workflow status and batch processing
- Batches - View and manage document batches
- Workflows - Inspect workflow definitions and runs
- Parameters - View and create parameter sets
- LanceDB - Manage vector databases
- Statistics - View processing metrics and performance data
API Documentation (Swagger UI):
For API testing and documentation:
http://localhost:8000/docs
Alternative API Documentation (ReDoc):
http://localhost:8000/redoc
Test the server:
curl http://localhost:8000/docsYou should see the Swagger UI.
curl -X POST "http://localhost:8000/api/v1/batch/" \
-d "source=test" \
-d "name=My First Batch"Response:
{
"batch_id": 1
}Option A: Upload a file
curl -X POST "http://localhost:8000/api/v1/document/ingest-document" \
-F "file=@sample.pdf" \
-F "source_uri=/documents/sample.pdf" \
-F "source=test" \
-F "batch_id=1"Option B: Provide a URI (requires Docling server)
curl -X POST "http://localhost:8000/api/v1/document/ingest-document" \
-F "input_uri=https://example.com/document.pdf" \
-F "source_uri=/remote/document.pdf" \
-F "source=test" \
-F "batch_id=1"Response:
{
"batch_id": 1,
"document_uri": "/documents/sample.pdf",
"document_hash": "sha256-abc123...",
"source": "test",
"uri_id": 1
}curl -X POST "http://localhost:8000/api/v1/batch/start-workflows" \
-d "batch_id=1" \
-d "workflow_definition_id=batch"Response:
{
"message": "Workflows started",
"workflows": 1,
"run_group": 1
}Check batch status:
curl "http://localhost:8000/api/v1/batch/status?batch_id=1"Response:
{
"batch": { ... },
"document_count": 1,
"workflow_count": {
"COMPLETED": 0,
"RUNNING": 1,
"PENDING": 0
},
"workflows": [ ... ],
"parsed": 0,
"remaining": 1
}Watch workflow runs:
watch -n 5 'curl -s "http://localhost:8000/api/v1/workflow/?batch_id=1"'Once processing completes, check the document:
curl "http://localhost:8000/api/v1/document/?batch_id=1"List available workflows:
si-cli list-workflowsInspect a workflow:
si-cli dump-workflow batchView workflow runs:
curl "http://localhost:8000/api/v1/workflow/?batch_id=1"List parameter sets:
si-cli list-param-setsView parameters:
si-cli dump-param-set defaultCreate custom parameters:
- Copy
config/params/default.yamltoconfig/params/custom.yaml - Modify settings as needed
- Use in API:
-d "param_id=custom"
Run additional workers:
# Terminal 1
si-cli worker
# Terminal 2
si-cli worker
# Terminal 3
si-cli workerEach worker processes steps independently, increasing throughput.
Using si-diag (recommended):
si-diag batch list # List all batches
si-diag status running # Currently running steps
si-diag status recent hour # Recent activity
si-diag run-group list --batch-id 1 # Run groups for a batch
si-diag workflow list 1 # Workflow runs in a run group
si-diag document find "sample.pdf" # Search documents by URIAPI Documentation: Browse to http://localhost:8000/docs for interactive API docs.
Database Inspection:
sqlite3 db/documents.db
sqlite> .tables
sqlite> SELECT * FROM documentbatch;
sqlite> SELECT * FROM workflowrun WHERE batch_id = 1;Problem: Configuration validation fails
Solution:
si-cli validate-settingsFix any reported errors in your .env file.
Problem: Port already in use
Solution:
si-cli serve --port 8001Problem: Workflows remain in PENDING status
Solution: Ensure a worker is running:
si-cli workerCheck worker logs for errors.
Problem: Parse step fails with connection error
Solution:
-
Verify Docling server is running
-
Check
DOCLING_SERVER_URLis correct -
Test connectivity:
curl http://localhost:5001/v1/health
Problem: Database connection fails
Solution:
- Check
DOC_DB_URLformat - Ensure directory exists:
mkdir -p db - Check permissions:
chmod 755 db - Reinitialize:
si-cli db-init
For active development:
1. Enable auto-reload:
si-cli serve --reload2. Set debug logging:
export LOG_LEVEL=DEBUG
si-cli serve --reload3. Watch logs:
si-cli serve --reload 2>&1 | tee server.log4. Monitor database:
watch -n 2 'sqlite3 db/documents.db "SELECT status, COUNT(*) FROM workflowrun GROUP BY status"'Create production .env:
# Database
DOC_DB_URL=postgresql+asyncpg://user:password@db-host:5432/soliplex
# Services
DOCLING_SERVER_URL=http://docling-prod:5001/v1
# Logging
LOG_LEVEL=WARNING
# Performance
INGEST_WORKER_CONCURRENCY=20
DOCLING_CONCURRENCY=5
WORKER_TASK_COUNT=10
# Storage
FILE_STORE_DIR=/var/lib/soliplex/files
LANCEDB_DIR=/var/lib/soliplex/lancedbServer:
si-cli serve --host 0.0.0.0 --port 8000 --workers 4Workers: (in separate processes)
si-cli worker # Worker 1
si-cli worker # Worker 2
si-cli worker # Worker 3Behind Nginx:
upstream soliplex {
server 127.0.0.1:8000;
}
server {
listen 80;
server_name soliplex.example.com;
location / {
proxy_pass http://soliplex;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}Dockerfile:
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e .
CMD ["si-cli", "serve", "--host", "0.0.0.0"]docker-compose.yml:
version: '3.8'
services:
server:
build: .
ports:
- "8000:8000"
environment:
DOC_DB_URL: postgresql+asyncpg://postgres:password@db/soliplex
DOCLING_SERVER_URL: http://docling:5001/v1
depends_on:
- db
worker:
build: .
command: si-cli worker
environment:
DOC_DB_URL: postgresql+asyncpg://postgres:password@db/soliplex
DOCLING_SERVER_URL: http://docling:5001/v1
depends_on:
- db
deploy:
replicas: 3
db:
image: postgres:16
environment:
POSTGRES_DB: soliplex
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
volumes:
- db-data:/var/lib/postgresql/data
volumes:
db-data:Run:
docker-compose up -d- Architecture Overview - System design and components
- API Reference - Complete REST API documentation
- Workflow System - Workflow concepts and configuration
- Database Schema - Data models and relationships
- Configuration - Environment variables and settings
- CLI Reference - Command-line interface guide (si-cli and si-diag)
Check the examples/ directory (if available) for:
- Sample workflows
- Integration scripts
- Custom step handlers
- Batch processing examples
- Issues: Report bugs and request features
- Discussions: Ask questions and share ideas
- Contributing: See CONTRIBUTING.md (if available)
import asyncio
from pathlib import Path
import httpx
async def ingest_directory(directory: Path, batch_id: int, source: str):
"""Ingest all documents in a directory."""
async with httpx.AsyncClient() as client:
for file_path in directory.glob("**/*.pdf"):
with open(file_path, "rb") as f:
files = {"file": f}
data = {
"source_uri": str(file_path),
"source": source,
"batch_id": batch_id,
}
response = await client.post(
"http://localhost:8000/api/v1/document/ingest-document",
files=files,
data=data,
)
print(f"Ingested {file_path}: {response.status_code}")
# Usage
asyncio.run(ingest_directory(Path("/documents"), batch_id=1, source="filesystem"))import asyncio
import httpx
async def wait_for_batch(batch_id: int, poll_interval: int = 5):
"""Wait for batch processing to complete."""
async with httpx.AsyncClient() as client:
while True:
response = await client.get(
f"http://localhost:8000/api/v1/batch/status",
params={"batch_id": batch_id}
)
data = response.json()
counts = data["workflow_count"]
print(f"Completed: {counts.get('COMPLETED', 0)}, "
f"Running: {counts.get('RUNNING', 0)}, "
f"Failed: {counts.get('FAILED', 0)}")
if counts.get("RUNNING", 0) == 0 and counts.get("PENDING", 0) == 0:
print("Batch complete!")
break
await asyncio.sleep(poll_interval)
# Usage
asyncio.run(wait_for_batch(1))#!/bin/bash
# retry_failed.sh
BATCH_ID=$1
RUN_GROUP=$(curl -s "http://localhost:8000/api/v1/workflow/run-groups?batch_id=${BATCH_ID}" | jq -r '.[0].id')
if [ -n "$RUN_GROUP" ]; then
curl -X POST "http://localhost:8000/api/v1/workflow/retry" \
-d "run_group_id=${RUN_GROUP}"
echo "Retried run group ${RUN_GROUP}"
else
echo "No run group found for batch ${BATCH_ID}"
fiNow that you have Soliplex Ingester running:
- Customize workflows - Create workflows for your specific needs
- Integrate services - Connect your data sources and RAG backends
- Scale processing - Add more workers and optimize configuration
- Monitor production - Set up logging, metrics, and alerting
- Build applications - Use the API to build document processing apps
Welcome to Soliplex Ingester! 🚀