Create task for astroid PR-2496 by danielzayas · Pull Request #177 · SWE-bench/SWE-smith

danielzayas · 2025-11-29T04:23:41Z

Background

The SWE-bench/SWE-smith dataset on HuggingFace already contains 81 tasks from pylint-dev/astroid built via the PR Mirroring method. I wanted to try creating a new tasks for an existing repo, so I picked a PR that did not already have a task and followed the CONTRIBUTING.md instructions ➡️ proposed a new task.

RE "upload the zipped file as a new PR", I can't create a PR without a diff versus main, so I force-added unzipped SWE-smith/logs/ files to the commit + uploaded the corresponding ZIP files (linked below). Given the .gitignore, we don't actually want to merge the SWE-smith/logs/ changes to main.

Proposed Change

Create new task for astroid PR-2496 pylint-dev/astroid#2496 via the PR Mirroring method as per SWE-smith/docs/guides/create_instances.md.

https://github.com/SWE-bench/SWE-smith/blob/main/CONTRIBUTING.md#create-task-instances ➡️ "Zip this folder" ➡️ "bug_gen.zip" file:

bug_gen.zip

Test Plan

Ran swesmith/harness/valid.py ➡️ logs/run_validation/pylint-dev__astroid.b114f6b5/ output ➡️ "run_validation.zip" file output:

run_validation.zip

Ran swesmith/harness/eval.py ➡️ logs/run_evaluation/eval_astroid_2496/ output ➡️ "run_evaluation.zip" file output:

run_validation.zip

Ran swesmith/harness/gather.py ➡️ logs/task_insts/ ➡️ "tasks_insts.zip" file output:

task_insts.zip

Things I Learned (TIL)

(1) generate.py does not apply diff if > 1000 lines

SWE-smith/swesmith/bug_gen/mirror/generate.py couldn’t apply the original diff directly because one file is several thousand lines long. The mirroring flow only attempts a direct git apply if every edited file satisfies the heuristics inside should_attempt_recovery, including "No changed file is >1000 lines". Falling back to the “recovery” path was faulty because the LLM-produced revert missed logic ➡️ Fix was to raise that limit to 10 000 lines and re-run the pipeline, the generator was allowed to reuse the actual PR diff and the mirrored bug patch became bit-for-bit identical to the upstream fix ➡️ contribute the change back to SWE-smith upstream via #178.

(2) valid.py runs the test suite from the pylint-dev/astroid upstream, not the mirrored and pruned repo at danielzayas/astroid

Validation says, "The validation harness works in two steps. First, it 1) runs the original repository's test suite to get the passing statuses of the existing tests. Then, it 2) applies each candidate task instance to the repository and runs the test suite again". It seems like (1) runs the upstream test suite versus (2) runs the test suite after mirroring and pruning ➡️ test files might not be present in the latter post-pruning ➡️ the eval.py run was checking three PASS‑to‑PASS tests that never execute inside the container (tests/brain/test_nose.py::NoseBrainTest::test_nose_tools, tests/test_modutils.py::BackportStdlibNamesTest::test_import_error, tests/test_nodes.py::BoundMethodNodeTest::test_is_property). They appear in the valid.py logs from (1) but aren’t present in the mirror snapshot used for (2), so the grading script always marks them as missing. I removed those three entries from logs/task_insts/pylint-dev__astroid.b114f6b5.json’s PASS_TO_PASS list. If they're not pruned and included in the mirror, they can be added to the P2P test coverage.

john-b-yang · 2025-12-10T23:07:12Z

Thanks so much @danielzayas! Just letting you know I saw this - will come back to this by the EOW.

danielzayas · 2026-01-06T19:53:58Z

I wanted to try creating a new tasks for an existing repo, so I picked a PR that did not already have a task and followed the CONTRIBUTING.md instructions ➡️ proposed a new task

I created this PR mainly to learn about the repo and workflow described in CONTRIBUTING.md ➡️ totally fine to close this without merging to main. I don't expect that particular task instance to be world-changing lol.

The primary positive outcomes of this seem to be

Fixed "(1) generate.py does not apply diff if > 1000 lines" via increase max file size for diff apply #178
Identified issue "(2) valid.py runs the test suite from the pylint-dev/astroid upstream, not the mirrored and pruned repo at danielzayas/astroid" ➡️ I opened issue valid.py runs test suite from upstream repo, not the mirrored and pruned repo #194

TLDR: we can merge if we want the new task and just close if not

And thanks again for the awesome open source project!!

Create task for astroid PR-2496

eb2806f

danielzayas marked this pull request as ready for review November 29, 2025 04:55

danielzayas mentioned this pull request Nov 29, 2025

increase max file size for diff apply #178

Merged

add repo_name and base_commit tp task for astroid PR-2496

b509081

danielzayas mentioned this pull request Jan 6, 2026

valid.py runs test suite from upstream repo, not the mirrored and pruned repo #194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create task for astroid PR-2496#177

Create task for astroid PR-2496#177
danielzayas wants to merge 2 commits intoSWE-bench:mainfrom
danielzayas:astroid-pr-2496

danielzayas commented Nov 29, 2025 •

edited

Loading

Uh oh!

john-b-yang commented Dec 10, 2025 •

edited

Loading

Uh oh!

danielzayas commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielzayas commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Proposed Change

Test Plan

Things I Learned (TIL)

Uh oh!

john-b-yang commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielzayas commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielzayas commented Nov 29, 2025 •

edited

Loading

john-b-yang commented Dec 10, 2025 •

edited

Loading

danielzayas commented Jan 6, 2026 •

edited

Loading