Skip to content

Script to quickly extract scores#75

Merged
kumarkrishna merged 7 commits intomainfrom
skylight-runs
Mar 18, 2026
Merged

Script to quickly extract scores#75
kumarkrishna merged 7 commits intomainfrom
skylight-runs

Conversation

@Pd172944
Copy link
Collaborator

When running benchmark suites, this script makes it a lot easier to print out scores for data collection and averaging.

Usage:
python scripts/extract_benchmark_scores.py --directory /path/to/results/directory

if name in original_implementations:
module.config._attn_implementation = original_implementations[name]

def _postprocess_answer(self, answer: str, generation_kwargs: Dict[str, Any]) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

@sahiljoshi515
Copy link
Collaborator

@claude review

@claude
Copy link

claude bot commented Mar 16, 2026

Claude finished @sahiljoshi515's task in 59s —— View job


Code Review

  • Gather context
  • Read changed files
  • Provide review feedback

scripts/extract_benchmark_scores.py — Generally clean utility script

Minor issues:

  1. Python 3.10+ type union syntax (float | None) — The get_score_from_metrics return type hint uses the X | Y syntax (line 17), which requires Python 3.10+. If the repo targets earlier Python versions, use Optional[float] from typing instead.

  2. Silent error swallowing — The except (json.JSONDecodeError, OSError): continue block (lines 71–72) silently skips malformed/unreadable files with no warning. A print(f"Warning: skipping {metrics_path}: {e}", file=sys.stderr) would help users notice corrupted results files.

  3. task_scores fallback is fragile — The fallback logic in get_score_from_metrics picks the first key from task_scores dict (line 31). While dict ordering is stable in Python 3.7+, this is an implicit assumption. If there are multiple tasks and the first doesn't reflect what users want, it will silently return the wrong score. A comment or explicit prioritization would clarify intent.

  4. generation_kwargs unused — The parameter is accepted but unused in _postprocess_answer; see below.


sparse_attention_hub/adapters/huggingface.py — Dead code

Addressing the review comment: "Why do we need this?"

_postprocess_answer (lines 379–391) is never called anywhere in the codebase. _generate_response decodes and directly returns the answer on line 490 without invoking this method. The method also accepts generation_kwargs but doesn't use it.

This appears to be either:

  • Scaffolding intended for future use that was committed prematurely, or
  • A method that was meant to be wired into _generate_response but wasn't

If this is meant to strip leading whitespace from generated answers, it should be called at the end of _generate_response before the return answer on line 490. If it's just placeholder code, it should be removed to avoid confusion.

Fix this →


Summary: The script itself is functional and well-structured. The main concern is the unused _postprocess_answer method — it should either be wired in or removed.

@sahiljoshi515
Copy link
Collaborator

Also fix lints.

@Pd172944
Copy link
Collaborator Author

@sahiljoshi515 See lints/checks passed.

@kumarkrishna
Copy link
Collaborator

@claude review the PR in scope of the changes

@claude
Copy link

claude bot commented Mar 17, 2026

Claude finished @kumarkrishna's task in 1m 17s —— View job


Code Review

  • Gather context
  • Read changed files
  • Provide review feedback

scripts/extract_benchmark_scores.py — New utility script

Minor issues:

  1. Python 3.10+ type union syntaxfloat | None (line 17) and list[tuple[str, float]] (line 61) require Python 3.10+. pyproject.toml now targets python_version = "3.12" for mypy, but if the runtime minimum is lower, use Optional[float] and List[Tuple[str, float]] from typing instead. This is consistent with the rest of the codebase which uses Optional[...]/Dict[...] everywhere.

  2. Silent file errors — The except (json.JSONDecodeError, OSError): continue (lines 71–72) drops errors silently. A print(f"Warning: skipping {metrics_path}: {e}", file=sys.stderr) would help users diagnose corrupted results.

  3. Fragile task_scores fallback — The fallback takes next(iter(task_scores.values())), which picks the first task arbitrarily. If the dict contains multiple tasks with different structures, this can silently return the wrong score. A comment explaining why the first is preferred would help.


pyproject.tomlpython_version bumped to 3.12, follow_imports = "skip" added

Concern: Bumping mypy's python_version from 3.9 to 3.12 is a meaningful change that affects type-checking behavior (e.g., allows X | Y union syntax, list[...] as a generic, etc.). This appears to have been done to make mypy happy with the new script's type hints. If the project still needs to support Python 3.9/3.10/3.11 at runtime, this mypy setting is now misaligned with actual compatibility requirements. Consider clarifying the minimum supported Python version in pyproject.toml (e.g., requires-python = ">=3.12") or keeping python_version = "3.9" and using from __future__ import annotations in the new script.

follow_imports = "skip" suppresses mypy errors on third-party imports globally, which could mask real type issues. This is a blunt workaround — the previous ignore_missing_imports = true on a per-module basis was more targeted.


sparse_attention_hub/adapters/huggingface.py — Type annotation improvements

The changes add explicit type annotations to instance attributes (model_kwargs, tokenizer_kwargs, model_registry_path, allow_unregistered_models) and add runtime type guards when extracting from **kwargs. This is clean and improves mypy strictness compliance. The **kwargs: Any fix (replacing **kwargs: Dict[str, Any]) is correct — **kwargs annotates the value type, not a dict type.

No issues here.


sparse_attention_hub/adapters/model_servers/huggingface.py — GPU device handling simplification

The removed branch handled gpu_id values that already had the "cuda" prefix (e.g., "cuda:0"). Removing it means passing "cuda:0" as gpu_id will now produce "cuda:cuda:0", which would raise a RuntimeError from PyTorch.

Is this safe? Only if callers are guaranteed to pass an integer or a bare integer-as-string for gpu_id. Worth a quick check of all call sites to confirm gpu_id is always a plain integer.

Check this →


sparse_attention_hub/sparse_attention/utils/mask.py — Attribute initialization refactor

Moving the self.mask, self.indices, self.ptr, self.data assignments to the top of __init__ is a solid refactor — it ensures all attributes are always defined regardless of which branch runs, which is good practice for __init__ methods and makes mypy/IDEs happier. The change is correct and non-breaking.

The int(self.indices.numel()) == 0 cast (line 166) is a minor defensive fix — numel() already returns an int, but the explicit cast suppresses a mypy false positive. Fine as-is.


tests/ — Compatibility shims for newer Transformers versions

test_double_sparsity_correctness.py: The rope_theta compatibility patch (if not hasattr(config, "rope_theta"): setattr(...)) is a clean cross-version guard. Good.

test_quest_correctness.py: The apply_rotary_pos_emb monkeypatch is more invasive — it patches a module-level function and uses request.addfinalizer to restore it. The implementation is correct and the patch is properly scoped to the fixture. One observation: the shim assumes unsqueeze_dim=1 is always the right default when position_ids is a tensor — this may not hold for all sequence configurations. The comment explains the upstream API change clearly, which is appreciated.


Summary

The core script (extract_benchmark_scores.py) is functional. The main follow-up items are:

  1. GPU device string regression in huggingface.py model server — verify "cuda:X"-prefixed strings are never passed as gpu_id.
  2. Python version alignment — the mypy python_version = "3.12" bump should be consistent with the runtime minimum, or the new script should use pre-3.10 type annotation syntax.
  3. follow_imports = "skip" is broader than necessary and may hide real issues.

Copy link
Collaborator

@kumarkrishna kumarkrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kumarkrishna kumarkrishna merged commit f21cf16 into main Mar 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants