Ceres

Harvest-first toolkit for open data portals

Quick Start • What Ceres Does • Usage • REST API

Ceres is a Rust toolkit for harvesting metadata from open data portals and keeping that catalog synchronized over time.

Harvesting is the center of the project. Embeddings, semantic search, exports, and the REST API build on top of the harvested catalog when you want them.

Named after the Roman goddess of harvest and agriculture.

What Ceres Does

Harvests dataset metadata from portal APIs into PostgreSQL with incremental sync, delta detection, and stale dataset tracking
Works in metadata-only mode, so you can build and maintain a local catalog without any embedding provider configured
Adds embeddings later through a separate pipeline, with local Ollama as the recommended zero-cost path and OpenAI or Gemini still available
Exposes optional search, export, and API layers once your catalog is populated

Why This Shape

Most open data tooling focuses on search first. In practice, the hard part is getting a reliable, repeatable harvesting pipeline:

Portals expose different APIs and quality levels
Large catalogs need incremental syncs and bounded memory usage
Removed datasets need to be detected without deleting history
Embedding should not be coupled to harvesting, because it is optional, slower, and operationally distinct

Ceres addresses that by splitting the system into two stages:

Harvest and normalize metadata
Optionally embed and search that catalog

Current Scope

Harvesting: CKAN, DCAT-AP udata REST, and SPARQL-backed DCAT endpoints (e.g. data.europa.eu)
Embeddings: Ollama locally, or Gemini/OpenAI if you prefer hosted providers
Search: semantic search over datasets that already have embeddings, backed by a tuned HNSW index for catalogs at scale
Export: JSONL, JSON, CSV, and curated Parquet
Operations: CLI, REST API, database-backed harvest jobs, graceful shutdown, protected admin endpoints

Key Capabilities

Harvest-first workflow with optional embedding and search
Streaming harvest pipeline for large portals
Incremental sync plus content-hash delta detection
Metadata-only mode with no embedding dependency
Standalone embed command for backfills and provider switches
Local Ollama embedding support with native batching
Recoverable job queue for API-triggered harvests
Soft stale detection for datasets removed upstream
Batch harvesting through portals.toml
Export pipeline for downstream analytics and HuggingFace publishing

Supported Portal Types

Today, the shipped portal clients cover:

ckan
dcat for udata-flavored DCAT-AP portals such as data.public.lu and data.gouv.fr
dcat with --profile sparql for SPARQL-backed DCAT catalogs such as data.europa.eu

The codebase already models additional portal types such as socrata, but they are not yet implemented in the current client factory.

Quick Start

Prerequisites

Rust 1.88+
Docker and Docker Compose
PostgreSQL 16+ with pgvector when running outside Docker
Optional for embeddings: Ollama locally, or Gemini/OpenAI credentials

1. Clone and start the database

git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
docker compose up db -d
cp .env.example .env
make migrate

2. Harvest metadata first

# Single CKAN portal
cargo run --bin ceres -- harvest https://dati.comune.milano.it --metadata-only

# Single DCAT portal
cargo run --bin ceres -- harvest https://data.public.lu --type dcat --metadata-only

# All enabled portals from config
cargo run --bin ceres -- harvest --config examples/portals.toml --metadata-only

3. Add embeddings only if you want semantic search

With Ollama locally:

ollama serve
ollama pull nomic-embed-text

export EMBEDDING_PROVIDER=ollama
cargo run --bin ceres -- embed

4. Search or export

cargo run --bin ceres -- search "public transport" --limit 5
cargo run --bin ceres -- export --format jsonl > datasets.jsonl
cargo run --bin ceres -- stats

Configuration

Embeddings are optional

If you only want harvesting, run with --metadata-only and skip embedding configuration entirely.

If you want embeddings later, set:

EMBEDDING_PROVIDER=ollama
OLLAMA_ENDPOINT=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text

Hosted providers are still supported:

EMBEDDING_PROVIDER=gemini
GEMINI_API_KEY=...

# or

EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=...
EMBEDDING_MODEL=text-embedding-3-small

Portal configuration

Batch harvesting uses portals.toml:

[[portals]]
name = "milano"
url = "https://dati.comune.milano.it"
type = "ckan"
language = "it"

[[portals]]
name = "luxembourg"
url = "https://data.public.lu"
type = "dcat"
language = "fr"
enabled = false

See examples/portals.toml for a larger configuration set.

Usage

Harvest

# All enabled portals from config
ceres harvest

# Ad-hoc CKAN harvest
ceres harvest https://dati.comune.milano.it

# Ad-hoc DCAT harvest
ceres harvest https://data.public.lu --type dcat

# Ad-hoc SPARQL-backed DCAT harvest
ceres harvest https://data.europa.eu --type dcat --profile sparql

# Named portal from config
ceres harvest --portal milano --config examples/portals.toml

# Force full sync
ceres harvest --portal milano --full-sync

# Dry run
ceres harvest --portal milano --dry-run --metadata-only

Embed

# Embed everything pending
ceres embed

# Only one portal
ceres embed --portal https://dati.comune.milano.it

Search

ceres search "air quality monitoring" --limit 10

Export

ceres export --format jsonl > datasets.jsonl
ceres export --format csv > datasets.csv
ceres export --format parquet --output ./ceres-export

Stats

ceres stats

Harvesting Model

The harvesting pipeline is built around a few operational principles:

Incremental sync when the source supports it
Full-sync fallback when incremental fetch is not available or fails
Content-hash delta detection so unchanged datasets are not re-embedded
Streaming page-by-page processing to keep memory bounded
Stale dataset marking instead of hard deletion

Embedding is fully decoupled from harvesting:

HarvestService stores and updates metadata
EmbeddingService processes datasets with missing embeddings
HarvestPipeline composes both when you want the combined flow

That separation is what makes local-first embedding practical and keeps harvest jobs usable even when no embedder is configured.

REST API

Start the server:

cargo run --bin ceres-server

Available endpoints:

GET /api/v1/health
GET /api/v1/stats
GET /api/v1/search
GET /api/v1/portals
GET /api/v1/portals/{name}/stats
GET /api/v1/harvest/status
GET /api/v1/datasets/{id}
POST /api/v1/portals/{name}/harvest
POST /api/v1/harvest
GET /api/v1/export
GET /swagger-ui

Set CERES_ADMIN_TOKEN to enable protected write endpoints.

Website Docs

The website lives in website/ and documents the same harvest-first model:

harvesting architecture
optional embeddings and costs
contributing and security notes

Related Projects

Ceres-Claude-Skill for Claude Code and Claude custom skill support
AndreaBozzo/ceres-open-data-index for published dataset snapshots

Version

Current version: 0.4.0

Contributing

See website/src/content/docs/CONTRIBUTING.md and the crate-level docs for development setup, tests, and release workflow.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.cargo		.cargo
.github		.github
crates		crates
examples		examples
migrations		migrations
scripts		scripts
website		website
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cliff.toml		cliff.toml
clippy.toml		clippy.toml
compose.yml		compose.yml
deny.toml		deny.toml

Folders and files

Latest commit

History

Repository files navigation

Ceres

What Ceres Does

Why This Shape

Current Scope

Key Capabilities

Supported Portal Types

Quick Start

Prerequisites

1. Clone and start the database

2. Harvest metadata first

3. Add embeddings only if you want semantic search

4. Search or export

Configuration

Embeddings are optional

Portal configuration

Usage

Harvest

Embed

Search

Export

Stats

Harvesting Model

REST API

Website Docs

Related Projects

Version

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages