Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Code Output Format Requirements in Prompts #53

Open
Haskely opened this issue Mar 3, 2025 · 1 comment
Open

Inconsistent Code Output Format Requirements in Prompts #53

Haskely opened this issue Mar 3, 2025 · 1 comment

Comments

@Haskely
Copy link

Haskely commented Mar 3, 2025

I've identified some contradictory requirements in the prompt format specifications across different components of the SWELancer benchmark:

First, I extracted some examples from the prompt column in swelancer_tasks.csv and found that they all contain either two messages (if variant is swe_manager) or three messages (if variant is ic_swe)

  1. Initial Message Format Requirements
    In the first message from swelancer_tasks.csv, the prompt explicitly requires:
...

Output Format:
Output only the final Python script in plaintext wrapped in <code></code> tags. Do not include any extraneous text, explanations, or formatting outside of these tags.

Package Management:
If your script depends on packages that are not preinstalled, include the appropriate installation commands at the beginning of your script using !pip install package1 package2 ... or !apt-get install package as needed. Assume that git is not installed unless you explicitly install it.

Debugging and Iteration:
While you may include debugging print statements or verbose outputs during development, ensure that your final script (the one enclosed in <code></code> tags) is the complete, standalone solution.

...
  1. Final Message Format Requirements
    However, the last message contradicts this by specifying:
You must send Python commands using ```python <code here>``` syntax; this is the ONLY valid way to send commands.
  1. Hardcoded Message in Evaluation Code
    Furthermore, there's a hardcoded message in the evaluation code
    for prompt_message in task.prompt:
    messages.append({
    "role": "user",
    "content": str(prompt_message["content"]) # type: ignore
    })
    messages.append({"role": "user", "content": """The repository is cloned in your CWD. You must send Python code in backticks in each response to me, and I will execute the code and send you back the result, for example:
    ```python
    import os
    print(os.listdir())
    ```
    You can include other content in your message, as long as it has the Python code block. You must only submit one code block per message.
    The python code block is your means of interacting with the environment; write Python code to view files, edit files, and run commands. Do not leave any steps incomplete. You must implement all fixes yourself.
    When you are done, send "DONE" (and nothing else - you don't need a python code block), and we will grade your submission.
    Please note that the Python code is not a Jupyter notebook; you must write a full Python script each time, including imports, and printing things if you want to see the output.
    """})
    max_turns = 30
    print(messages)
    that adds additional, different requirements:
messages.append({"role": "user", "content": """
    ...
    You must send Python code in backticks in each response...
    Please note that the Python code is not a Jupyter notebook...
    """})
  1. Code Extraction Implementation
    The actual code extraction in the evaluation
    python_blocks = re.findall(r"```python\s*\n(.*?)\n```", model_response, re.DOTALL)

    only looks for code blocks in the ```python format:

These inconsistencies create confusion:

  1. The first message suggests using <code></code> tags and allows Jupyter-style commands
  2. The final message mandates using python format
  3. The hardcoded evaluation message reinforces the python format and explicitly states it's not a Jupyter environment

Recommendation:
Consider standardizing the format requirements across all prompts and ensuring they align with the actual code extraction implementation. This will provide clearer instructions to the models and prevent potential evaluation issues.

@Haskely
Copy link
Author

Haskely commented Mar 3, 2025

Image

The appendix in your paper shows examples indicating that you initially provided a Jupyter environment. However, for reasons such as wanting to "run SWELancer at scale in your own environment, you'll need to implement your own compute infrastructure," you likely removed the Jupyter dependency to simplify user onboarding. This change was made to adopt a pure bash python -c "code" approach. Unfortunately, issues arose during the data migration process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant