mbpp

Here are 9 public repositories matching this topic...

abhaymundhara / llm-benchmark-suite

Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.

python benchmark evaluation gemini openai code-generation claude streamlit humaneval llm ollama swe-bench mbpp bigcodebench

Updated Apr 23, 2026
Python

Miaoge-Ge / llm-eval-framework

Star

A lightweight, configuration-driven evaluation framework for LLM code generation & reasoning tasks (MBPP, HumanEval, GSM8K). Supports multi-provider (DeepSeek, OpenAI, ZhipuAI) and concurrent execution.

benchmark evaluation humaneval llm gsm8k mbpp

Updated May 27, 2026
Python

OpenMLRL / LLM_Collab_Code_Generation

Star

LLM Collaboration for Code Generation

code-generation multi-agent-systems multi-agent-reinforcement-learning humaneval large-language-models code-agent mbpp comlrl openmlrl coophumaneval

Updated Feb 17, 2026
Python

jumincho / workflow-as-expert-router

Star

흐름(workflow)까지 라우팅 — 휴면 · Routing workflows, not just models, as experts (WaE vs MasRouter on MBPP/HumanEval). Systems-pattern gain reproduced; dynamic-routing headline unresolved

nlp workflow evaluation research-archive humaneval llm vllm llm-router expert-routing mbpp

Updated May 28, 2026
Python

Shreyash-Gaur / TensorFlow_Python_Code_Generation

Star

Fine-tuning CodeT5 for Python code generation on the MBPP dataset. Features custom TensorFlow training loops, mixed precision, XLA optimization, and distributed multi-GPU strategies.

deep-learning tensorflow transformer code-generation distributed-training mixed-precision huggingface nl2code text-to-code llm generative-ai codet5 mbpp

Updated Mar 19, 2025
Jupyter Notebook

jcartu / qwen36-27b-blackwell-stress-validation

Star

Stress-validation of Qwen3.6-27B inference configurations on dual RTX PRO 6000 Blackwell. 5 configs x 4 phases (gates, throughput matrix, HumanEval, MBPP) = 2,105 hard coding problems, zero crashes. Headline: FP8+MTP=3 wins HumanEval (79.3%), BF16+DFlash wins MBPP (89.5%). MTP=5 dominated on correctness despite faster raw tok/s.

benchmark inference blackwell humaneval vllm qwen speculative-decoding qwen3 mbpp rtx-pro-6000

Updated May 7, 2026
Python

jcartu / llm-stress-harness

Star

Diagnostic toolkit for self-hosted LLM inference: failure-taxonomic stress harness + 4-phase orchestrator + parametric vLLM launchers

python benchmarking inference stress-testing humaneval llm vllm speculative-decoding sglang mbpp

Updated May 11, 2026
Shell

jcartu / qwen-bench-2026-05-11-v2-followup

Star

Study #4: FP8+MTP{3,5} speed on repne/vllm:v2 + max_tokens=8192 quality re-runs for BF16+DFlash n=8 and FP8+MTP=3. Follow-up to studies #2 and #3.

benchmark inference mtp blackwell humaneval vllm speculative-decoding qwen3 mbpp dflash qwen-bench

Updated May 11, 2026
Python

scouzi1966 / qwen-humaneval

Star

🧪 Automated LLM coding benchmarks with Ollama - HumanEval & MBPP evaluation suite with safe execution, comprehensive logging, and detailed analysis tools

python benchmarking machine-learning evaluation coding humaneval llm ollama qwen mbpp

Updated Aug 1, 2025
Python

Improve this page

Add a description, image, and links to the mbpp topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mbpp topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mbpp

Here are 9 public repositories matching this topic...

abhaymundhara / llm-benchmark-suite

Miaoge-Ge / llm-eval-framework

OpenMLRL / LLM_Collab_Code_Generation

jumincho / workflow-as-expert-router

Shreyash-Gaur / TensorFlow_Python_Code_Generation

jcartu / qwen36-27b-blackwell-stress-validation

jcartu / llm-stress-harness

jcartu / qwen-bench-2026-05-11-v2-followup

scouzi1966 / qwen-humaneval

Improve this page

Add this topic to your repo