Script to quickly extract scores by Pd172944 · Pull Request #75 · skylight-org/sparse-attention-hub

Pd172944 · 2026-03-15T05:00:37Z

When running benchmark suites, this script makes it a lot easier to print out scores for data collection and averaging.

Usage:
python scripts/extract_benchmark_scores.py --directory /path/to/results/directory

sahiljoshi515 · 2026-03-16T01:03:24Z

sparse_attention_hub/adapters/huggingface.py

                if name in original_implementations:
                    module.config._attn_implementation = original_implementations[name]

+    def _postprocess_answer(self, answer: str, generation_kwargs: Dict[str, Any]) -> str:


Why do we need this?

sahiljoshi515 · 2026-03-16T01:03:50Z

@claude review

claude · 2026-03-16T01:04:06Z

Claude finished @sahiljoshi515's task in 59s —— View job

Code Review

Gather context
Read changed files
Provide review feedback

`scripts/extract_benchmark_scores.py` — Generally clean utility script

Minor issues:

Python 3.10+ type union syntax (float | None) — The get_score_from_metrics return type hint uses the X | Y syntax (line 17), which requires Python 3.10+. If the repo targets earlier Python versions, use Optional[float] from typing instead.
Silent error swallowing — The except (json.JSONDecodeError, OSError): continue block (lines 71–72) silently skips malformed/unreadable files with no warning. A print(f"Warning: skipping {metrics_path}: {e}", file=sys.stderr) would help users notice corrupted results files.
task_scores fallback is fragile — The fallback logic in get_score_from_metrics picks the first key from task_scores dict (line 31). While dict ordering is stable in Python 3.7+, this is an implicit assumption. If there are multiple tasks and the first doesn't reflect what users want, it will silently return the wrong score. A comment or explicit prioritization would clarify intent.
generation_kwargs unused — The parameter is accepted but unused in _postprocess_answer; see below.

`sparse_attention_hub/adapters/huggingface.py` — Dead code

Addressing the review comment: "Why do we need this?"

_postprocess_answer (lines 379–391) is never called anywhere in the codebase. _generate_response decodes and directly returns the answer on line 490 without invoking this method. The method also accepts generation_kwargs but doesn't use it.

This appears to be either:

Scaffolding intended for future use that was committed prematurely, or
A method that was meant to be wired into _generate_response but wasn't

If this is meant to strip leading whitespace from generated answers, it should be called at the end of _generate_response before the return answer on line 490. If it's just placeholder code, it should be removed to avoid confusion.

Fix this →

Summary: The script itself is functional and well-structured. The main concern is the unused _postprocess_answer method — it should either be wired in or removed.

sahiljoshi515 · 2026-03-16T01:14:33Z

Also fix lints.

Pd172944 · 2026-03-17T09:19:34Z

@sahiljoshi515 See lints/checks passed.

kumarkrishna · 2026-03-17T17:41:50Z

@claude review the PR in scope of the changes

claude · 2026-03-17T17:42:10Z

Claude finished @kumarkrishna's task in 1m 17s —— View job

Code Review

Gather context
Read changed files
Provide review feedback

`scripts/extract_benchmark_scores.py` — New utility script

Minor issues:

Python 3.10+ type union syntax — float | None (line 17) and list[tuple[str, float]] (line 61) require Python 3.10+. pyproject.toml now targets python_version = "3.12" for mypy, but if the runtime minimum is lower, use Optional[float] and List[Tuple[str, float]] from typing instead. This is consistent with the rest of the codebase which uses Optional[...]/Dict[...] everywhere.
Silent file errors — The except (json.JSONDecodeError, OSError): continue (lines 71–72) drops errors silently. A print(f"Warning: skipping {metrics_path}: {e}", file=sys.stderr) would help users diagnose corrupted results.
Fragile task_scores fallback — The fallback takes next(iter(task_scores.values())), which picks the first task arbitrarily. If the dict contains multiple tasks with different structures, this can silently return the wrong score. A comment explaining why the first is preferred would help.

`pyproject.toml` — `python_version` bumped to 3.12, `follow_imports = "skip"` added

Concern: Bumping mypy's python_version from 3.9 to 3.12 is a meaningful change that affects type-checking behavior (e.g., allows X | Y union syntax, list[...] as a generic, etc.). This appears to have been done to make mypy happy with the new script's type hints. If the project still needs to support Python 3.9/3.10/3.11 at runtime, this mypy setting is now misaligned with actual compatibility requirements. Consider clarifying the minimum supported Python version in pyproject.toml (e.g., requires-python = ">=3.12") or keeping python_version = "3.9" and using from __future__ import annotations in the new script.

follow_imports = "skip" suppresses mypy errors on third-party imports globally, which could mask real type issues. This is a blunt workaround — the previous ignore_missing_imports = true on a per-module basis was more targeted.

`sparse_attention_hub/adapters/huggingface.py` — Type annotation improvements

The changes add explicit type annotations to instance attributes (model_kwargs, tokenizer_kwargs, model_registry_path, allow_unregistered_models) and add runtime type guards when extracting from **kwargs. This is clean and improves mypy strictness compliance. The **kwargs: Any fix (replacing **kwargs: Dict[str, Any]) is correct — **kwargs annotates the value type, not a dict type.

No issues here.

`sparse_attention_hub/adapters/model_servers/huggingface.py` — GPU device handling simplification

The removed branch handled gpu_id values that already had the "cuda" prefix (e.g., "cuda:0"). Removing it means passing "cuda:0" as gpu_id will now produce "cuda:cuda:0", which would raise a RuntimeError from PyTorch.

Is this safe? Only if callers are guaranteed to pass an integer or a bare integer-as-string for gpu_id. Worth a quick check of all call sites to confirm gpu_id is always a plain integer.

Check this →

`sparse_attention_hub/sparse_attention/utils/mask.py` — Attribute initialization refactor

Moving the self.mask, self.indices, self.ptr, self.data assignments to the top of __init__ is a solid refactor — it ensures all attributes are always defined regardless of which branch runs, which is good practice for __init__ methods and makes mypy/IDEs happier. The change is correct and non-breaking.

The int(self.indices.numel()) == 0 cast (line 166) is a minor defensive fix — numel() already returns an int, but the explicit cast suppresses a mypy false positive. Fine as-is.

`tests/` — Compatibility shims for newer Transformers versions

test_double_sparsity_correctness.py: The rope_theta compatibility patch (if not hasattr(config, "rope_theta"): setattr(...)) is a clean cross-version guard. Good.

test_quest_correctness.py: The apply_rotary_pos_emb monkeypatch is more invasive — it patches a module-level function and uses request.addfinalizer to restore it. The implementation is correct and the patch is properly scoped to the fixture. One observation: the shim assumes unsqueeze_dim=1 is always the right default when position_ids is a tensor — this may not hold for all sequence configurations. The comment explains the upstream API change clearly, which is appreciated.

Summary

The core script (extract_benchmark_scores.py) is functional. The main follow-up items are:

GPU device string regression in huggingface.py model server — verify "cuda:X"-prefixed strings are never passed as gpu_id.
Python version alignment — the mypy python_version = "3.12" bump should be consistent with the runtime minimum, or the new script should use pre-3.10 type annotation syntax.
follow_imports = "skip" is broader than necessary and may hide real issues.

kumarkrishna

lgtm

script to quickly extract average scores

518c667

Pd172944 requested review from apd10, kumarkrishna and sahiljoshi515 March 15, 2026 05:02

fixing postprocess

0f664ce

sahiljoshi515 reviewed Mar 16, 2026

View reviewed changes

Pd172944 added 5 commits March 17, 2026 01:33

removing postprocess

e6b0046

lint fix 1

897f604

trying lint fix again

7230dbc

fixing lints 3

f79b7d8

fixing line black check

91ccbea

kumarkrishna approved these changes Mar 18, 2026

View reviewed changes

kumarkrishna merged commit f21cf16 into main Mar 18, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to quickly extract scores#75

Script to quickly extract scores#75
kumarkrishna merged 7 commits intomainfrom
skylight-runs

Pd172944 commented Mar 15, 2026

Uh oh!

sahiljoshi515 Mar 16, 2026

Uh oh!

sahiljoshi515 commented Mar 16, 2026

Uh oh!

claude bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

sahiljoshi515 commented Mar 16, 2026

Uh oh!

Pd172944 commented Mar 17, 2026

Uh oh!

kumarkrishna commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

kumarkrishna left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Pd172944 commented Mar 15, 2026

Uh oh!

sahiljoshi515 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

sahiljoshi515 commented Mar 16, 2026

Uh oh!

claude bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

scripts/extract_benchmark_scores.py — Generally clean utility script

sparse_attention_hub/adapters/huggingface.py — Dead code

Uh oh!

sahiljoshi515 commented Mar 16, 2026

Uh oh!

Pd172944 commented Mar 17, 2026

Uh oh!

kumarkrishna commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

scripts/extract_benchmark_scores.py — New utility script

pyproject.toml — python_version bumped to 3.12, follow_imports = "skip" added

sparse_attention_hub/adapters/huggingface.py — Type annotation improvements

sparse_attention_hub/adapters/model_servers/huggingface.py — GPU device handling simplification

sparse_attention_hub/sparse_attention/utils/mask.py — Attribute initialization refactor

tests/ — Compatibility shims for newer Transformers versions

Summary

Uh oh!

kumarkrishna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Mar 16, 2026 •

edited

Loading

`scripts/extract_benchmark_scores.py` — Generally clean utility script

`sparse_attention_hub/adapters/huggingface.py` — Dead code

claude bot commented Mar 17, 2026 •

edited

Loading

`scripts/extract_benchmark_scores.py` — New utility script

`pyproject.toml` — `python_version` bumped to 3.12, `follow_imports = "skip"` added

`sparse_attention_hub/adapters/huggingface.py` — Type annotation improvements

`sparse_attention_hub/adapters/model_servers/huggingface.py` — GPU device handling simplification

`sparse_attention_hub/sparse_attention/utils/mask.py` — Attribute initialization refactor

`tests/` — Compatibility shims for newer Transformers versions