You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've identified some contradictory requirements in the prompt format specifications across different components of the SWELancer benchmark:
First, I extracted some examples from the prompt column in swelancer_tasks.csv and found that they all contain either two messages (if variant is swe_manager) or three messages (if variant is ic_swe)
Initial Message Format Requirements
In the first message from swelancer_tasks.csv, the prompt explicitly requires:
...
Output Format:
Output only the final Python script in plaintext wrapped in <code></code> tags. Do not include any extraneous text, explanations, or formatting outside of these tags.
Package Management:
If your script depends on packages that are not preinstalled, include the appropriate installation commands at the beginning of your script using !pip install package1 package2 ... or !apt-get install package as needed. Assume that git is not installed unless you explicitly install it.
Debugging and Iteration:
While you may include debugging print statements or verbose outputs during development, ensure that your final script (the one enclosed in <code></code> tags) is the complete, standalone solution.
...
Final Message Format Requirements
However, the last message contradicts this by specifying:
You must send Python commands using ```python <code here>``` syntax; this is the ONLY valid way to send commands.
Hardcoded Message in Evaluation Code
Furthermore, there's a hardcoded message in the evaluation code
messages.append({"role": "user", "content": """The repository is cloned in your CWD. You must send Python code in backticks in each response to me, and I will execute the code and send you back the result, for example:
```python
import os
print(os.listdir())
```
You can include other content in your message, as long as it has the Python code block. You must only submit one code block per message.
The python code block is your means of interacting with the environment; write Python code to view files, edit files, and run commands. Do not leave any steps incomplete. You must implement all fixes yourself.
When you are done, send "DONE" (and nothing else - you don't need a python code block), and we will grade your submission.
Please note that the Python code is not a Jupyter notebook; you must write a full Python script each time, including imports, and printing things if you want to see the output.
"""})
max_turns=30
print(messages)
that adds additional, different requirements:
messages.append({"role": "user", "content": """ ... You must send Python code in backticks in each response... Please note that the Python code is not a Jupyter notebook... """})
Code Extraction Implementation
The actual code extraction in the evaluation
only looks for code blocks in the ```python format:
These inconsistencies create confusion:
The first message suggests using <code></code> tags and allows Jupyter-style commands
The final message mandates using python format
The hardcoded evaluation message reinforces the python format and explicitly states it's not a Jupyter environment
Recommendation:
Consider standardizing the format requirements across all prompts and ensuring they align with the actual code extraction implementation. This will provide clearer instructions to the models and prevent potential evaluation issues.
The text was updated successfully, but these errors were encountered:
The appendix in your paper shows examples indicating that you initially provided a Jupyter environment. However, for reasons such as wanting to "run SWELancer at scale in your own environment, you'll need to implement your own compute infrastructure," you likely removed the Jupyter dependency to simplify user onboarding. This change was made to adopt a pure bash python -c "code" approach. Unfortunately, issues arose during the data migration process.
I've identified some contradictory requirements in the prompt format specifications across different components of the SWELancer benchmark:
First, I extracted some examples from the prompt column in swelancer_tasks.csv and found that they all contain either two messages (if variant is swe_manager) or three messages (if variant is ic_swe)
In the first message from swelancer_tasks.csv, the prompt explicitly requires:
However, the last message contradicts this by specifying:
Furthermore, there's a hardcoded message in the evaluation code
SWELancer-Benchmark/swelancer_agent.py
Lines 105 to 126 in e69ab27
The actual code extraction in the evaluation
SWELancer-Benchmark/swelancer_agent.py
Line 150 in e69ab27
only looks for code blocks in the ```python format:
These inconsistencies create confusion:
<code></code>
tags and allows Jupyter-style commandspython
formatpython
format and explicitly states it's not a Jupyter environmentRecommendation:
Consider standardizing the format requirements across all prompts and ensuring they align with the actual code extraction implementation. This will provide clearer instructions to the models and prevent potential evaluation issues.
The text was updated successfully, but these errors were encountered: