GitHub - SocialGouv/questions-ecrites

QE Expert Assignment Toolkit

This repository ingests job descriptions into a Qdrant vector store and uses retrieval augmented reranking to match unanswered questions with the people most likely to have the right expertise.

Key components

scripts/ingest_job_descriptions.py — parses job description files, splits them into semantically meaningful chunks, embeds each chunk with Socle IA, and upserts the vectors into Qdrant.
scripts/assign_questions_to_expert.py — embeds unanswered questions, retrieves the most relevant job chunks from Qdrant, reranks them with Albert, and outputs suggested expert matches.

Local stack & database setup

# start Qdrant + Postgres locally
docker compose up qdrant postgres -d

# install deps (once)
poetry install

# run database migrations
poetry run alembic upgrade head

By default the Postgres service exposes postgresql://qe:qe@localhost:5433/qe. You can override this by setting the standard PGHOST, PGPORT, PGUSER, PGPASSWORD, and PGDATABASE environment variables.

scripts/ingest_job_descriptions.py now stores the chunk cache and ingest manifest in Postgres tables managed via Alembic migrations.

Environment variables

Variable	Required	Used by	Purpose / default
`SOCLE_IA_API_KEY`	✅	ingestion, assignment, test scripts	Socle IA API key for embeddings + LLM chunking.
`ALBERT_API_KEY`	✅ (assignment)	assignment	Albert rerank API key.
`LLM_BASE_URL`	✅ (ingestion)	ingestion	Base URL for Socle IA API (embeddings + chat). No default in code.
`LLM_MODEL`	✅ (ingestion)	ingestion	Model name for LLM chunking. No default in code.
`EMBEDDING_MODEL`	✅ (ingestion)	ingestion	Embedding model name. No default in code.
`PGHOST`	❌	ingestion	Postgres host (default: `localhost`).
`PGPORT`	❌	ingestion	Postgres port (default: `5433`).
`PGUSER`	❌	ingestion	Postgres user (default: `qe`).
`PGPASSWORD`	❌	ingestion	Postgres password (default: `qe`).
`PGDATABASE`	❌	ingestion	Postgres database (default: `qe`).

Chunk-aware ingestion

Recent improvements keep fine-grained context by chunking each job description before embedding:

Section detection: headings (uppercase lines, numbered sections, or lines ending with :) start new sections. When no headings are present, the whole document becomes a single section.
Size-controlled chunks: sections are further split into ~1,800-character windows with a minimum chunk size of 350 characters and 200-character overlaps to avoid losing context.
Rich metadata: every chunk stores job title, section title, chunk indices, a short preview, and the document hash so downstream consumers can display precise snippets and keep only changed chunks up to date.
Manifest-driven idempotence: manifest rows in Postgres track the last ingested hash for each file. Unchanged files are skipped automatically, and removed files trigger deletion of their Qdrant points.

Running the ingestion script

python scripts/ingest_job_descriptions.py \\
  --input-dir data/job_descriptions \\
  --collection job_descriptions \\
  --qdrant-url http://localhost:6333

Supported file types: .txt, .pdf, .doc, .docx.

The default chunking strategy uses the Socle chat completions endpoint at https://pliage-prod.socle-ia.data-ia.prod.atlas.fabrique.social.gouv.fr/api/v1/chat/completions to turn job descriptions into single-sentence responsibilities. Override this target with --llm-base-url if you have another Socle cluster, or disable the LLM step entirely with --chunking-strategy heuristic. A 405 response typically means the host you pointed to doesn’t expose /api/v1/chat/completions—pass the correct base URL or fall back to the heuristic chunker to keep ingestion running. Supported file types: .txt, .pdf, .doc, .docx.

Assigning questions to experts

export SOCLE_IA_API_KEY=...
export ALBERT_API_KEY=...
python scripts/assign_questions_to_expert.py \
  --input-dir data/qe_no_answers \
  --collection job_descriptions \
  --top-k 30 \
  --top-n 5 \
  --max-chunks-per-job 3

The assignment process now deduplicates rerank results per job so you see at most the --max-chunks-per-job highest scoring chunks for each expert. The JSON output stored at data/assignments.json includes chunk previews, section names, and rerank positions to make manual review easier.

Suggested workflow

Keep job descriptions up to date in data/job_descriptions/ and re-run ingestion when files change.
Drop unanswered questions (one per file) into data/qe_no_answers/.
Run the assignment script and inspect data/assignments.json for the suggested experts and sections.
Iterate by reviewing chunk previews to confirm that the right expertise areas are being surfaced. Adjust chunk sizes or retrieval parameters as needed.

Resetting Qdrant state

If you want to wipe the job-description collection and start fresh:

python scripts/reset_dbs.py \
  --collection job_descriptions \
  --qdrant-url http://localhost:6333 \
  --input-dir data/job_descriptions

This deletes the collection in Qdrant (ignoring the request if it already disappeared) and removes the .ingest_manifest.json file so the next ingestion run reprocesses every document from scratch.

Troubleshooting

Ensure Qdrant is reachable at the URL you pass in --qdrant-url.
If embeddings fail, confirm the Socle IA API key has access to the selected model.
If LLM chunking fails with 405 Method Not Allowed, supply a working Socle chat base via --llm-base-url or rerun with --chunking-strategy heuristic.
When reranking returns no results, double-check the Albert API key and model name.

Feel free to tailor chunk sizes, overlaps, or rerank limits via the respective CLI flags to better match your dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.cline		.cline
.github/workflows		.github/workflows
.roo		.roo
alembic		alembic
qe		qe
scripts		scripts
.clinerules		.clinerules
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QE Expert Assignment Toolkit

Key components

Local stack & database setup

Environment variables

Chunk-aware ingestion

Running the ingestion script

Assigning questions to experts

Suggested workflow

Resetting Qdrant state

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QE Expert Assignment Toolkit

Key components

Local stack & database setup

Environment variables

Chunk-aware ingestion

Running the ingestion script

Assigning questions to experts

Suggested workflow

Resetting Qdrant state

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages