Ceres is a Rust toolkit for harvesting metadata from open data portals and keeping that catalog synchronized over time.
Harvesting is the center of the project. Embeddings, semantic search, exports, and the REST API build on top of the harvested catalog when you want them.
Named after the Roman goddess of harvest and agriculture.
- Harvests dataset metadata from portal APIs into PostgreSQL with incremental sync, delta detection, and stale dataset tracking
- Works in metadata-only mode, so you can build and maintain a local catalog without any embedding provider configured
- Adds embeddings later through a separate pipeline, with local Ollama as the recommended zero-cost path and OpenAI or Gemini still available
- Exposes optional search, export, and API layers once your catalog is populated
Most open data tooling focuses on search first. In practice, the hard part is getting a reliable, repeatable harvesting pipeline:
- Portals expose different APIs and quality levels
- Large catalogs need incremental syncs and bounded memory usage
- Removed datasets need to be detected without deleting history
- Embedding should not be coupled to harvesting, because it is optional, slower, and operationally distinct
Ceres addresses that by splitting the system into two stages:
- Harvest and normalize metadata
- Optionally embed and search that catalog
- Harvesting: CKAN, DCAT-AP udata REST, and SPARQL-backed DCAT endpoints (e.g.
data.europa.eu) - Embeddings: Ollama locally, or Gemini/OpenAI if you prefer hosted providers
- Search: semantic search over datasets that already have embeddings, backed by a tuned HNSW index for catalogs at scale
- Export: JSONL, JSON, CSV, and curated Parquet
- Operations: CLI, REST API, database-backed harvest jobs, graceful shutdown, protected admin endpoints
- Harvest-first workflow with optional embedding and search
- Streaming harvest pipeline for large portals
- Incremental sync plus content-hash delta detection
- Metadata-only mode with no embedding dependency
- Standalone
embedcommand for backfills and provider switches - Local Ollama embedding support with native batching
- Recoverable job queue for API-triggered harvests
- Soft stale detection for datasets removed upstream
- Batch harvesting through
portals.toml - Export pipeline for downstream analytics and HuggingFace publishing
Today, the shipped portal clients cover:
ckandcatfor udata-flavored DCAT-AP portals such asdata.public.luanddata.gouv.frdcatwith--profile sparqlfor SPARQL-backed DCAT catalogs such asdata.europa.eu
The codebase already models additional portal types such as socrata, but they are not yet implemented in the current client factory.
- Rust 1.88+
- Docker and Docker Compose
- PostgreSQL 16+ with
pgvectorwhen running outside Docker - Optional for embeddings: Ollama locally, or Gemini/OpenAI credentials
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
docker compose up db -d
cp .env.example .env
make migrate# Single CKAN portal
cargo run --bin ceres -- harvest https://dati.comune.milano.it --metadata-only
# Single DCAT portal
cargo run --bin ceres -- harvest https://data.public.lu --type dcat --metadata-only
# All enabled portals from config
cargo run --bin ceres -- harvest --config examples/portals.toml --metadata-onlyWith Ollama locally:
ollama serve
ollama pull nomic-embed-text
export EMBEDDING_PROVIDER=ollama
cargo run --bin ceres -- embedcargo run --bin ceres -- search "public transport" --limit 5
cargo run --bin ceres -- export --format jsonl > datasets.jsonl
cargo run --bin ceres -- statsIf you only want harvesting, run with --metadata-only and skip embedding configuration entirely.
If you want embeddings later, set:
EMBEDDING_PROVIDER=ollama
OLLAMA_ENDPOINT=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-textHosted providers are still supported:
EMBEDDING_PROVIDER=gemini
GEMINI_API_KEY=...
# or
EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=...
EMBEDDING_MODEL=text-embedding-3-smallBatch harvesting uses portals.toml:
[[portals]]
name = "milano"
url = "https://dati.comune.milano.it"
type = "ckan"
language = "it"
[[portals]]
name = "luxembourg"
url = "https://data.public.lu"
type = "dcat"
language = "fr"
enabled = falseSee examples/portals.toml for a larger configuration set.
# All enabled portals from config
ceres harvest
# Ad-hoc CKAN harvest
ceres harvest https://dati.comune.milano.it
# Ad-hoc DCAT harvest
ceres harvest https://data.public.lu --type dcat
# Ad-hoc SPARQL-backed DCAT harvest
ceres harvest https://data.europa.eu --type dcat --profile sparql
# Named portal from config
ceres harvest --portal milano --config examples/portals.toml
# Force full sync
ceres harvest --portal milano --full-sync
# Dry run
ceres harvest --portal milano --dry-run --metadata-only# Embed everything pending
ceres embed
# Only one portal
ceres embed --portal https://dati.comune.milano.itceres search "air quality monitoring" --limit 10ceres export --format jsonl > datasets.jsonl
ceres export --format csv > datasets.csv
ceres export --format parquet --output ./ceres-exportceres statsThe harvesting pipeline is built around a few operational principles:
- Incremental sync when the source supports it
- Full-sync fallback when incremental fetch is not available or fails
- Content-hash delta detection so unchanged datasets are not re-embedded
- Streaming page-by-page processing to keep memory bounded
- Stale dataset marking instead of hard deletion
Embedding is fully decoupled from harvesting:
HarvestServicestores and updates metadataEmbeddingServiceprocesses datasets with missing embeddingsHarvestPipelinecomposes both when you want the combined flow
That separation is what makes local-first embedding practical and keeps harvest jobs usable even when no embedder is configured.
Start the server:
cargo run --bin ceres-serverAvailable endpoints:
GET /api/v1/healthGET /api/v1/statsGET /api/v1/searchGET /api/v1/portalsGET /api/v1/portals/{name}/statsGET /api/v1/harvest/statusGET /api/v1/datasets/{id}POST /api/v1/portals/{name}/harvestPOST /api/v1/harvestGET /api/v1/exportGET /swagger-ui
Set CERES_ADMIN_TOKEN to enable protected write endpoints.
The website lives in website/ and documents the same harvest-first model:
- harvesting architecture
- optional embeddings and costs
- contributing and security notes
- Ceres-Claude-Skill for Claude Code and Claude custom skill support
- AndreaBozzo/ceres-open-data-index for published dataset snapshots
Current version: 0.4.0
See website/src/content/docs/CONTRIBUTING.md and the crate-level docs for development setup, tests, and release workflow.
Apache-2.0. See LICENSE.
