Inconsistent Code Output Format Requirements in Prompts #53

Haskely · 2025-03-03T12:23:59Z

I've identified some contradictory requirements in the prompt format specifications across different components of the SWELancer benchmark:

First, I extracted some examples from the prompt column in swelancer_tasks.csv and found that they all contain either two messages (if variant is swe_manager) or three messages (if variant is ic_swe)

Initial Message Format Requirements
In the first message from swelancer_tasks.csv, the prompt explicitly requires:

...

Output Format:
Output only the final Python script in plaintext wrapped in <code></code> tags. Do not include any extraneous text, explanations, or formatting outside of these tags.

Package Management:
If your script depends on packages that are not preinstalled, include the appropriate installation commands at the beginning of your script using !pip install package1 package2 ... or !apt-get install package as needed. Assume that git is not installed unless you explicitly install it.

Debugging and Iteration:
While you may include debugging print statements or verbose outputs during development, ensure that your final script (the one enclosed in <code></code> tags) is the complete, standalone solution.

...

Final Message Format Requirements
However, the last message contradicts this by specifying:

You must send Python commands using ```python <code here>``` syntax; this is the ONLY valid way to send commands.

Hardcoded Message in Evaluation Code
Furthermore, there's a hardcoded message in the evaluation code

SWELancer-Benchmark/swelancer_agent.py

Lines 105 to 126 in e69ab27

    
                           for prompt_message in task.prompt: 
        
                               messages.append({ 
        
                                   "role": "user", 
        
                                   "content": str(prompt_message["content"]) # type: ignore 
        
                               }) 
        
                           messages.append({"role": "user", "content": """The repository is cloned in your CWD. You must send Python code in backticks in each response to me, and I will execute the code and send you back the result, for example: 
        
           ```python 
        
           import os 
        
           print(os.listdir()) 
        
           ``` 
        
           You can include other content in your message, as long as it has the Python code block. You must only submit one code block per message. 
        
           The python code block is your means of interacting with the environment; write Python code to view files, edit files, and run commands. Do not leave any steps incomplete. You must implement all fixes yourself. 
        
           When you are done, send "DONE" (and nothing else - you don't need a python code block), and we will grade your submission. 
        
           Please note that the Python code is not a Jupyter notebook; you must write a full Python script each time, including imports, and printing things if you want to see the output. 
        
           """}) 
        
                           max_turns = 30 
        
                           print(messages)

that adds additional, different requirements:

messages.append({"role": "user", "content": """
    ...
    You must send Python code in backticks in each response...
    Please note that the Python code is not a Jupyter notebook...
    """})

Code Extraction Implementation
The actual code extraction in the evaluation

SWELancer-Benchmark/swelancer_agent.py

Line 150 in e69ab27

python_blocks = re.findall(r"```python\s*\n(.*?)\n```", model_response, re.DOTALL)

only looks for code blocks in the ```python format:

These inconsistencies create confusion:

The first message suggests using <code></code> tags and allows Jupyter-style commands
The final message mandates using python format
The hardcoded evaluation message reinforces the python format and explicitly states it's not a Jupyter environment

Recommendation:
Consider standardizing the format requirements across all prompts and ensuring they align with the actual code extraction implementation. This will provide clearer instructions to the models and prevent potential evaluation issues.

The text was updated successfully, but these errors were encountered:

Haskely · 2025-03-03T13:42:48Z

The appendix in your paper shows examples indicating that you initially provided a Jupyter environment. However, for reasons such as wanting to "run SWELancer at scale in your own environment, you'll need to implement your own compute infrastructure," you likely removed the Jupyter dependency to simplify user onboarding. This change was made to adopt a pure bash python -c "code" approach. Unfortunately, issues arose during the data migration process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Code Output Format Requirements in Prompts #53

Inconsistent Code Output Format Requirements in Prompts #53

Haskely commented Mar 3, 2025

Haskely commented Mar 3, 2025

Inconsistent Code Output Format Requirements in Prompts #53

Inconsistent Code Output Format Requirements in Prompts #53

Comments

Haskely commented Mar 3, 2025

Haskely commented Mar 3, 2025