Skip to content

DA2I2-SLM/DAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DAR Library: Diversity-Aware Retention for Multi-Agent Debate

Fast, modular, and diversity-aware β€” the open-source library for Multi-Agent Debate research.

LICENSE Python vLLM arXiv

πŸš€ Getting Started | πŸ”§ Usage | 🎯 Benchmarks | 🧠 Baselines | πŸ“‚ Project Structure | 🀝 Todo

DAR Library is a fast, modular open-source library for Multi-Agent Debate (MAD) research. Powered by vLLM, DAR Library delivers significantly faster inference than existing MAD frameworks, making large-scale debate experiments accessible and reproducible. πŸŽ‰ DAR Library ships with a comprehensive collection of SOTA MAD baselines out of the box, including our own Diversity-Aware Retention (DAR) method, so you can benchmark, extend, and build on the latest research in minutes. ✨ If you find DAR Library helpful, please share your feedback, cite our work, and give it a ⭐. Your support means a lot!

Why DAR Library?

  • DAR Library is up to 100Γ— faster than existing MAD libraries thanks to native vLLM integration with batched inference. Run more experiments, wait less.
  • DAR Library ships with SOTA baselines including uncertainty-aware prompting, voting mechanisms, uncertainty-based filtering, persona, society of minds, and topology variants (sparse, centralized), all in one place.
  • DAR Library is modular and extensible. Adding a new model, benchmark, or filtering strategy takes only a few lines of code.

About Our DAR Method:

  • DAR's core idea is simple: rather than forwarding all agent messages each round, only the most diverse and informative messages are retained. As such, DAR incorporates 3 controlling mechanisms:
    • πŸ”΅ Uncertainty Prompt: agents signal their confidence alongside their answers
    • πŸ—³οΈ Vote Prompt: agents vote across candidate answers to surface consensus
    • βœ‚οΈ Critical Filtering: only messages that challenge or diversify the consensus are retained for the next round
  • DAR delivers cost savings by streamlining agent communication, reducing message volume by up to 30–40%.
  • DAR scales seamlessly with agent growth, driving significant gains in performance (see example below).

πŸ“œ For more details, check out our paper. Please feel free to open issues or pull requests. We're constantly working to improve and expand the library.

Important

If you find this repository helpful for your work, please consider citing as follows:

@article{nguyen2026hear,
  title={Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention},
  author={Nguyen, Manh and Nguyen, Anh and Nguyen, Dung and Venkatesh, Svetha and Le, Hung},
  journal={arXiv preprint arXiv:2603.20640},
  year={2026}
}

πŸš€ Installation and Quick Start

⏬ Cloning the Repository

git clone https://github.com/DA2I2-SLM/DAR.git
cd DAR

πŸ’Ώ Installing Dependencies

Python 3.10 or higher is recommended.

conda create -n dar python=3.10.16 -y
conda activate dar
pip install -r requirements.txt

πŸ“Œ Using gated models (Llama-3.1 or Gemma)? Make sure you have requested access on Hugging Face and export your token before running any scripts:

export HF_TOKEN="your_huggingface_token_here"

For Hugging Face inference mode (no vLLM, slow), also place your token in a file named token (single line, no quotes).


πŸ”§ Usage

Run a basic MAD experiment on the Arithmetics dataset:

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2

If set up properly on H100 GPU, it should take less than 1 minute to finish the experiments! Demo Animation

Topology flags (append to any command):

  • Default: fully-connected
  • --sparse: sparse graph topology
  • --centralized: centralized (hub-and-spoke) topology

Persona flag:

  • --multi_persona: enables heterogeneous agent personas

Inference backend:

  • Default: vLLM (fast, batched β€” recommended for most benchmarks)
  • --use_hf_inference --hf_batch_size 16: Hugging Face Transformers backend (recommended for math datasets if using A100/H100 GPU; adjust batch size based on available memory)

✍️ Note: vLLM and HF backends may produce different results due to mismatches in sampling behavior and log-probability computation.

Implementation details for the filtering algorithms can be found in src/dev.py.

⚑ Quick Validation

Run a quick end-to-end validation of the full DAR pipeline with Qwen2.5-1.5B:

bash scripts/validate.sh

✍️ Note: due to vLLM sampling non-determinism, results may vary slightly across runs. We report averages over multiple seeds with standard deviations in the paper.


🎯 Benchmarks

☝️ Tested Models

  • Qwen2.5-1.5B, Qwen2.5-3B
  • Llama3.1-8B
  • Falcon3-7B

✌️ Supported Benchmarks

  • Math: Arithmetics, GSM8K
  • QA: MMLU (Formal Logic, Professional Medicine), HH-RLHF, CommonSenseQA

Datasets are handled automatically via data/data_utils.py β€” just pass the dataset name to --data.

πŸ”¬ Running Full Benchmark Experiments

Complete run commands for every benchmark are provided in scripts/. Below are representative examples.

Arithmetics, Qwen2.5-1.5B, DAR (vLLM backend):

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_critical

Arithmetics, Qwen2.5-1.5B, DAR (HF backend):

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_critical \
  --use_hf_inference --hf_batch_size 16

✍️ Note: We observe that Hugging Face inference helps MAD perform better on mathematical datasets on certain GPUs.

πŸ“’ Results and Logs

After each run:

  • Accuracy metrics are appended to out/<dataset>_vllm_batch_logs.tsv
  • Full debate history (agent messages, uncertainty scores ANLL, final answers) is serialized to out/history/<experiment_name>.jsonl
  • Use --debug to prepend DEBUG_ to the output filename for quick inspection runs

🧠 Baselines

DAR ships with a full set of MAD baselines so you can fairly compare against prior work out of the box. Please refer to our paper for more information on the baselines.

☝️ Standard Baselines

Basic MAD (no filtering):

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2

Top-K Uncertainty Filtering (retain 50% most certain):

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 --top_k_uncertainty 0.5

Uncertainty Prompt only:

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 --uncertainty_prompt True

Vote Prompt only:

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 --vote_prompt True

✌️ Our Method: DAR = Uncertainty Prompt + Vote Prompt + Critical Filtering

vLLM backend (default, fastest):

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_critical

HF backend:

python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_critical \
  --use_hf_inference --hf_batch_size 16

With a separate LLM Filtering Moderator:

python src/main.py --model qwen2.5-3b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_critical --separate_moderator qwen2.5-1.5b
Ablation variants (retaining criteria)
# Retain most certain messages
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_certain

# Retain supporting messages
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_support

# Retain non-voting messages
python src/main.py --model qwen2.5-1.5b --num_agents 4 --data arithmetics --data_size 100 --debate_rounds 2 \
  --uncertainty_prompt True --vote_prompt True --m_role filter_nonvote

πŸ“‚ Project Structure

DAR/
β”œβ”€β”€ data/              # Raw benchmark datasets (auto-downloaded and processed)
β”œβ”€β”€ out/               # TSV metric summaries + JSONL full debate history logs
β”œβ”€β”€ result/            # Runtime operation and token logs
β”œβ”€β”€ scripts/           # Bash scripts for large-scale benchmark experiments
└── src/
    β”œβ”€β”€ main.py        # Main entry point for the Multi-Agent Debate pipeline
    β”œβ”€β”€ dev.py         # Filtering algorithms (filter_critical, filter_support, etc.)
    β”œβ”€β”€ evaluator.py   # Metric evaluation and regex parsing of model outputs
    └── model/         # vLLM initialization and sampling configurations

🀝 Things to Do

  • Core MAD pipeline with vLLM backend
  • DAR method (Uncertainty + Voting + Critical Filtering)
  • Math benchmarks (Arithmetics, GSM8K)
  • QA benchmarks (MMLU, HH-RLHF, CommonSenseQA)
  • AIME24 / AIME25 support
  • Test on other LLMs

Any contribution you can make is welcome. Feel free to open issues, suggest features, or submit pull requests!


Acknowledgements

About us

This work is a team effort: @manhitv leads the development, @tienanh28122000 drives inference acceleration and experiments, and @thaihungle supervises the project.

About

Source code for the paper: Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors