π Getting Started | π§ Usage | π― Benchmarks | π§ Baselines | π Project Structure | π€ Todo
DAR Library is a fast, modular open-source library for Multi-Agent Debate (MAD) research. Powered by vLLM, DAR Library delivers significantly faster inference than existing MAD frameworks, making large-scale debate experiments accessible and reproducible. π DAR Library ships with a comprehensive collection of SOTA MAD baselines out of the box, including our own Diversity-Aware Retention (DAR) method, so you can benchmark, extend, and build on the latest research in minutes. β¨ If you find DAR Library helpful, please share your feedback, cite our work, and give it a β. Your support means a lot!
Why DAR Library?
- DAR Library is up to 100Γ faster than existing MAD libraries thanks to native vLLM integration with batched inference. Run more experiments, wait less.
- DAR Library ships with SOTA baselines including uncertainty-aware prompting, voting mechanisms, uncertainty-based filtering, persona, society of minds, and topology variants (sparse, centralized), all in one place.
- DAR Library is modular and extensible. Adding a new model, benchmark, or filtering strategy takes only a few lines of code.
About Our DAR Method:
- DAR's core idea is simple: rather than forwarding all agent messages each round, only the most diverse and informative messages are retained. As such, DAR incorporates 3 controlling mechanisms:
- π΅ Uncertainty Prompt: agents signal their confidence alongside their answers
- π³οΈ Vote Prompt: agents vote across candidate answers to surface consensus
- βοΈ Critical Filtering: only messages that challenge or diversify the consensus are retained for the next round
- DAR delivers cost savings by streamlining agent communication, reducing message volume by up to 30β40%.
- DAR scales seamlessly with agent growth, driving significant gains in performance (see example below).
π For more details, check out our paper. Please feel free to open issues or pull requests. We're constantly working to improve and expand the library.
Important
If you find this repository helpful for your work, please consider citing as follows:
@article{nguyen2026hear,
title={Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention},
author={Nguyen, Manh and Nguyen, Anh and Nguyen, Dung and Venkatesh, Svetha and Le, Hung},
journal={arXiv preprint arXiv:2603.20640},
year={2026}
}git clone https://github.com/DA2I2-SLM/DAR.git
cd DARPython 3.10 or higher is recommended.
conda create -n dar python=3.10.16 -y
conda activate dar
pip install -r requirements.txtπ Using gated models (Llama-3.1 or Gemma)? Make sure you have requested access on Hugging Face and export your token before running any scripts:
export HF_TOKEN="your_huggingface_token_here"For Hugging Face inference mode (no vLLM, slow), also place your token in a file named
token(single line, no quotes).
Run a basic MAD experiment on the Arithmetics dataset:
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2If set up properly on H100 GPU, it should take less than 1 minute to finish the experiments!

Topology flags (append to any command):
- Default: fully-connected
--sparse: sparse graph topology--centralized: centralized (hub-and-spoke) topology
Persona flag:
--multi_persona: enables heterogeneous agent personas
Inference backend:
- Default: vLLM (fast, batched β recommended for most benchmarks)
--use_hf_inference --hf_batch_size 16: Hugging Face Transformers backend (recommended for math datasets if using A100/H100 GPU; adjust batch size based on available memory)
βοΈ Note: vLLM and HF backends may produce different results due to mismatches in sampling behavior and log-probability computation.
Implementation details for the filtering algorithms can be found in src/dev.py.
Run a quick end-to-end validation of the full DAR pipeline with Qwen2.5-1.5B:
bash scripts/validate.shβοΈ Note: due to vLLM sampling non-determinism, results may vary slightly across runs. We report averages over multiple seeds with standard deviations in the paper.
Qwen2.5-1.5B,Qwen2.5-3BLlama3.1-8BFalcon3-7B
- Math: Arithmetics, GSM8K
- QA: MMLU (Formal Logic, Professional Medicine), HH-RLHF, CommonSenseQA
Datasets are handled automatically via data/data_utils.py β just pass the dataset name to --data.
Complete run commands for every benchmark are provided in scripts/. Below are representative examples.
Arithmetics, Qwen2.5-1.5B, DAR (vLLM backend):
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_criticalArithmetics, Qwen2.5-1.5B, DAR (HF backend):
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_critical \
--use_hf_inference --hf_batch_size 16βοΈ Note: We observe that Hugging Face inference helps MAD perform better on mathematical datasets on certain GPUs.
After each run:
- Accuracy metrics are appended to
out/<dataset>_vllm_batch_logs.tsv - Full debate history (agent messages, uncertainty scores ANLL, final answers) is serialized to
out/history/<experiment_name>.jsonl - Use
--debugto prependDEBUG_to the output filename for quick inspection runs
DAR ships with a full set of MAD baselines so you can fairly compare against prior work out of the box. Please refer to our paper for more information on the baselines.
Basic MAD (no filtering):
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2Top-K Uncertainty Filtering (retain 50% most certain):
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 --top_k_uncertainty 0.5Uncertainty Prompt only:
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 --uncertainty_prompt TrueVote Prompt only:
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 --vote_prompt TruevLLM backend (default, fastest):
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_criticalHF backend:
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_critical \
--use_hf_inference --hf_batch_size 16With a separate LLM Filtering Moderator:
python src/main.py --model qwen2.5-3b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_critical --separate_moderator qwen2.5-1.5bAblation variants (retaining criteria)
# Retain most certain messages
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_certain
# Retain supporting messages
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_support
# Retain non-voting messages
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
--uncertainty_prompt True --vote_prompt True --m_role filter_nonvoteDAR/
βββ data/ # Raw benchmark datasets (auto-downloaded and processed)
βββ out/ # TSV metric summaries + JSONL full debate history logs
βββ result/ # Runtime operation and token logs
βββ scripts/ # Bash scripts for large-scale benchmark experiments
βββ src/
βββ main.py # Main entry point for the Multi-Agent Debate pipeline
βββ dev.py # Filtering algorithms (filter_critical, filter_support, etc.)
βββ evaluator.py # Metric evaluation and regex parsing of model outputs
βββ model/ # vLLM initialization and sampling configurations
- Core MAD pipeline with vLLM backend
- DAR method (Uncertainty + Voting + Critical Filtering)
- Math benchmarks (Arithmetics, GSM8K)
- QA benchmarks (MMLU, HH-RLHF, CommonSenseQA)
- AIME24 / AIME25 support
- Test on other LLMs
Any contribution you can make is welcome. Feel free to open issues, suggest features, or submit pull requests!
This work is a team effort: @manhitv leads the development, @tienanh28122000 drives inference acceleration and experiments, and @thaihungle supervises the project.

