Open
Conversation
Member
|
Thanks so much @danielzayas! Just letting you know I saw this - will come back to this by the EOW. |
Contributor
Author
I created this PR mainly to learn about the repo and workflow described in CONTRIBUTING.md ➡️ totally fine to close this without merging to main. I don't expect that particular task instance to be world-changing lol. The primary positive outcomes of this seem to be
TLDR: we can merge if we want the new task and just close if not And thanks again for the awesome open source project!! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
The SWE-bench/SWE-smith dataset on HuggingFace already contains 81 tasks from pylint-dev/astroid built via the PR Mirroring method. I wanted to try creating a new tasks for an existing repo, so I picked a PR that did not already have a task and followed the
CONTRIBUTING.mdinstructions ➡️ proposed a new task.RE "upload the zipped file as a new PR", I can't create a PR without a diff versus main, so I force-added unzipped
SWE-smith/logs/files to the commit + uploaded the corresponding ZIP files (linked below). Given the.gitignore, we don't actually want to merge theSWE-smith/logs/changes to main.Proposed Change
Create new task for astroid PR-2496 pylint-dev/astroid#2496 via the PR Mirroring method as per
SWE-smith/docs/guides/create_instances.md.https://github.com/SWE-bench/SWE-smith/blob/main/CONTRIBUTING.md#create-task-instances ➡️ "Zip this folder" ➡️ "bug_gen.zip" file:
bug_gen.zip
Test Plan
Ran
swesmith/harness/valid.py➡️logs/run_validation/pylint-dev__astroid.b114f6b5/output ➡️ "run_validation.zip" file output:run_validation.zip
Ran
swesmith/harness/eval.py➡️logs/run_evaluation/eval_astroid_2496/output ➡️ "run_evaluation.zip" file output:run_validation.zip
Ran
swesmith/harness/gather.py➡️logs/task_insts/➡️ "tasks_insts.zip" file output:task_insts.zip
Things I Learned (TIL)
(1) generate.py does not apply diff if > 1000 lines
SWE-smith/swesmith/bug_gen/mirror/generate.pycouldn’t apply the original diff directly because one file is several thousand lines long. The mirroring flow only attempts a direct git apply if every edited file satisfies the heuristics insideshould_attempt_recovery, including "No changed file is >1000 lines". Falling back to the “recovery” path was faulty because the LLM-produced revert missed logic ➡️ Fix was to raise that limit to 10 000 lines and re-run the pipeline, the generator was allowed to reuse the actual PR diff and the mirrored bug patch became bit-for-bit identical to the upstream fix ➡️ contribute the change back to SWE-smith upstream via #178.(2)
valid.pyruns the test suite from the pylint-dev/astroid upstream, not the mirrored and pruned repo at danielzayas/astroidValidation says, "The validation harness works in two steps. First, it 1) runs the original repository's test suite to get the passing statuses of the existing tests. Then, it 2) applies each candidate task instance to the repository and runs the test suite again". It seems like (1) runs the upstream test suite versus (2) runs the test suite after mirroring and pruning ➡️ test files might not be present in the latter post-pruning ➡️ the
eval.pyrun was checking three PASS‑to‑PASS tests that never execute inside the container (tests/brain/test_nose.py::NoseBrainTest::test_nose_tools, tests/test_modutils.py::BackportStdlibNamesTest::test_import_error, tests/test_nodes.py::BoundMethodNodeTest::test_is_property). They appear in thevalid.pylogs from (1) but aren’t present in the mirror snapshot used for (2), so the grading script always marks them as missing. I removed those three entries fromlogs/task_insts/pylint-dev__astroid.b114f6b5.json’s PASS_TO_PASS list. If they're not pruned and included in the mirror, they can be added to the P2P test coverage.