Create a competitive agent with open LLMs #1085

neubig · 2024-04-14T02:48:04Z

What problem or use case are you trying to solve?

Currently OpenDevin somewhat works with the strongest closed LLMs such as GPT-4 or Claude Opus, but we have not confirmed good results with open LLMs that can be run locally. We would like to create a formula to achieve competitive results with local LMs.

Do you have thoughts on the technical implementation?

This will require a strong (perhaps fine-tuned) coding agent LLM. It will probably have to be tuned based on strong code LMs such as CodeLlama, StarCoder, DeepseekCoder, or some other yet-to-be-released LLM.

rezzie-rich · 2024-04-14T16:56:17Z

The user should be able to choose single or multiple LLM to power all the agents. For example, mixtral could power the generalized agents while deepseekcoder power the code generating agents, and white-rabbit-neo could power the testing/cybersecurity agents. This way, only one LLM will be active at a time as per the active agent, and multiple niche specific open LLM could collaborate to outperform private LLMs like gpt-4 while running locally on consumer grade hardware.

JayQuimby · 2024-04-17T13:51:09Z

I think the models need to be "self-prompting"

From the experience I have had with OpenDevin there are a lot of times it gets close to doing the thing that I want it to but it falls short of the goal and then just starts either repeating the same command or will just do something random.

It would be interesting to use two distinct prompting strategies so that the model effectively has a conversation with itself. The first prompt would be something along the lines of looking at its previous actions and the goal and coming up with a plan for the next action it could take. Then the second prompt would be getting the agent to perform an action based on the thoughts provided by the response to the first prompt.

I think this would offer the agent more flexibility and it would give it more ability to guide itself towards a better in context solution than any static prompt template can. the downside is that you need to have two model queries per action you take instead of one.

Also, Microsoft just released wizardLM 2 and it is way better than anything I have tried local so far.

chrisbraddock · 2024-04-17T15:41:56Z

gpt-pilot is quite good at this. Try it out to get an idea. I think there are planner and reviewer agents for each step.

I kind of wish OpenDevin incorporated gpt-pilot for the engine.

Jiayi-Pan · 2024-05-06T20:53:49Z

A nice way to improve open-source LLMs is by fine-tuning them with trajectories from stronger models like GPT-4. Bonus point if we can filter out the bad ones.

One way to achieve this at scale, similar to wildchat, is to provide officially hosted open-devin interfaces that come with a free GPT-4 based backend. In exchange for freely using these agents, users need to sign up to allow free distribution of the data and rank the quality of the agents' performance for us.

I imagine this could be used to:

Obtain diverse, high-quality trajectories to fine-tune open agents.
As a easy to start demo, attract more users
Potentially use human preference data to create a Chatbot Arena equivalent for coding agent

xingyaoww · 2024-05-06T20:55:32Z

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

Jiayi-Pan · 2024-05-06T20:57:34Z

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

Amazing and thanks for the pointer! I will have a look and see what I can contribute

xingyaoww · 2024-05-06T20:58:12Z

@Jiayi-Pan We are currently thinking about re-purposing the existing agent tuning dataset (e.g., code, agent tuning) for (1) so we can have a preliminary v0.1 OSS model :)

BradKML · 2024-06-03T08:06:29Z

Also does this feel like a technical foundation for building fine-tuning tool kits through generating quasi-synthetic data?

github-actions · 2024-09-02T01:56:08Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

neubig · 2024-09-02T20:45:24Z

We're still working on this!

dorbanianas · 2024-09-02T22:33:03Z

Hey @neubig , sorry for being late I was a bit busy these days and I was working on a small version but I had some resource limitations so I didn't progress.

github-actions · 2024-10-06T02:02:45Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-11-11T01:59:06Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

BradKML · 2024-11-11T04:23:30Z

@Jiayi-Pan here is a bit of a leading question:

What would the architecture of competitive coding LLM Arena look like? Would they be allowed to run their code multiple times to debug (and without limit for paywalled models)? Which type of judging criteria should we prioritize (code runtime vs code generation and debug time)?
What would the architecture of a fine-tuning dataset generator look like? Should we include every single coding problem along side codebase debugging problems? Should we include diverse programming languages (including ones that would have memory issues)? Should we mix pure implementations with library use?
(on a meta-level) Will the LLM be allowed to self-document programming methodologies (e.g. DS&A, design patterns, ML knowledge) between different mock-benchmarks? If so, then where would the mock-benchmark be sourced from that is unique from the core dataset that will be used to be compared with other SWE architectures?
(bonus question) how would Chain-of-Thought and other adjacent architecture be handled? This could be different from just picking LLM architectures where it is the token that are being predicted, and instead where to turn token outputs back into inputs https://arxiv.org/html/2401.14295v3

neubig · 2024-12-28T15:13:41Z

I think with deepseek-v3 and @Jiayi-Pan and @xingyaoww 's SWE-Gym project we now probably have open models that can achieve reasonable scores in OpenHands!

We still need to create a better leaderboard, but we can handle this isn a new issue: #5869

Congratulations to us on closing one of the oldest issues in our backlog :)

neubig added enhancement New feature or request severity:medium Affecting multiple users labels Apr 14, 2024

neubig assigned huybery and JustinLin610 Apr 14, 2024

neubig added this to OpenDevin Priority Roadmap Apr 14, 2024

neubig added this to the May 2024 milestone Apr 14, 2024

xingyaoww self-assigned this Apr 26, 2024

neubig modified the milestones: 2024-05, 2024-06 Jun 17, 2024

neubig self-assigned this Jun 17, 2024

neubig unassigned huybery and JustinLin610 Jul 3, 2024

neubig modified the milestones: 2024-06, 2024-07 Jul 5, 2024

neubig assigned dorbanianas Aug 2, 2024

neubig modified the milestones: 2024-07, 2024-08 Aug 2, 2024

This was referenced Aug 9, 2024

[Bug]: CodeActAgent unaware of the tools #3287

Closed

[Bug]: Unable to create file normally #3382

Closed

[Bug]: Deepseek-coder not working #3339

Closed

mamoodi mentioned this issue Aug 15, 2024

[Discussion]: Coding performance of local models #3407

Closed

github-actions bot added the Stale Inactive for 30 days label Sep 2, 2024

mamoodi removed the Stale Inactive for 30 days label Sep 2, 2024

github-actions bot added the Stale Inactive for 30 days label Oct 6, 2024

enyst removed the Stale Inactive for 30 days label Oct 6, 2024

github-actions bot added the Stale Inactive for 30 days label Nov 11, 2024

github-actions bot removed the Stale Inactive for 30 days label Dec 2, 2024

neubig closed this as completed Dec 28, 2024

github-project-automation bot moved this to Done in OpenDevin Priority Roadmap Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a competitive agent with open LLMs #1085

Create a competitive agent with open LLMs #1085

neubig commented Apr 14, 2024

rezzie-rich commented Apr 14, 2024 •

edited

Loading

JayQuimby commented Apr 17, 2024

chrisbraddock commented Apr 17, 2024

Jiayi-Pan commented May 6, 2024 •

edited

Loading

xingyaoww commented May 6, 2024

Jiayi-Pan commented May 6, 2024 •

edited

Loading

xingyaoww commented May 6, 2024

BradKML commented Jun 3, 2024

github-actions bot commented Sep 2, 2024

neubig commented Sep 2, 2024

dorbanianas commented Sep 2, 2024 •

edited

Loading

github-actions bot commented Oct 6, 2024

github-actions bot commented Nov 11, 2024

BradKML commented Nov 11, 2024

neubig commented Dec 28, 2024

Create a competitive agent with open LLMs #1085

Create a competitive agent with open LLMs #1085

Comments

neubig commented Apr 14, 2024

rezzie-rich commented Apr 14, 2024 • edited Loading

JayQuimby commented Apr 17, 2024

chrisbraddock commented Apr 17, 2024

Jiayi-Pan commented May 6, 2024 • edited Loading

xingyaoww commented May 6, 2024

Jiayi-Pan commented May 6, 2024 • edited Loading

xingyaoww commented May 6, 2024

BradKML commented Jun 3, 2024

github-actions bot commented Sep 2, 2024

neubig commented Sep 2, 2024

dorbanianas commented Sep 2, 2024 • edited Loading

github-actions bot commented Oct 6, 2024

github-actions bot commented Nov 11, 2024

BradKML commented Nov 11, 2024

neubig commented Dec 28, 2024

rezzie-rich commented Apr 14, 2024 •

edited

Loading

Jiayi-Pan commented May 6, 2024 •

edited

Loading

Jiayi-Pan commented May 6, 2024 •

edited

Loading

dorbanianas commented Sep 2, 2024 •

edited

Loading