Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a competitive agent with open LLMs #1085

Closed
neubig opened this issue Apr 14, 2024 · 15 comments
Closed

Create a competitive agent with open LLMs #1085

neubig opened this issue Apr 14, 2024 · 15 comments
Assignees
Labels
enhancement New feature or request severity:medium Affecting multiple users
Milestone

Comments

@neubig
Copy link
Contributor

neubig commented Apr 14, 2024

What problem or use case are you trying to solve?

Currently OpenDevin somewhat works with the strongest closed LLMs such as GPT-4 or Claude Opus, but we have not confirmed good results with open LLMs that can be run locally. We would like to create a formula to achieve competitive results with local LMs.

Do you have thoughts on the technical implementation?

This will require a strong (perhaps fine-tuned) coding agent LLM. It will probably have to be tuned based on strong code LMs such as CodeLlama, StarCoder, DeepseekCoder, or some other yet-to-be-released LLM.

@neubig neubig added enhancement New feature or request severity:medium Affecting multiple users labels Apr 14, 2024
@neubig neubig added this to the May 2024 milestone Apr 14, 2024
@rezzie-rich
Copy link

rezzie-rich commented Apr 14, 2024

The user should be able to choose single or multiple LLM to power all the agents. For example, mixtral could power the generalized agents while deepseekcoder power the code generating agents, and white-rabbit-neo could power the testing/cybersecurity agents. This way, only one LLM will be active at a time as per the active agent, and multiple niche specific open LLM could collaborate to outperform private LLMs like gpt-4 while running locally on consumer grade hardware.

@JayQuimby
Copy link
Contributor

I think the models need to be "self-prompting"

From the experience I have had with OpenDevin there are a lot of times it gets close to doing the thing that I want it to but it falls short of the goal and then just starts either repeating the same command or will just do something random.

It would be interesting to use two distinct prompting strategies so that the model effectively has a conversation with itself. The first prompt would be something along the lines of looking at its previous actions and the goal and coming up with a plan for the next action it could take. Then the second prompt would be getting the agent to perform an action based on the thoughts provided by the response to the first prompt.

I think this would offer the agent more flexibility and it would give it more ability to guide itself towards a better in context solution than any static prompt template can. the downside is that you need to have two model queries per action you take instead of one.

Also, Microsoft just released wizardLM 2 and it is way better than anything I have tried local so far.

@chrisbraddock
Copy link

gpt-pilot is quite good at this. Try it out to get an idea. I think there are planner and reviewer agents for each step.

I kind of wish OpenDevin incorporated gpt-pilot for the engine.

@xingyaoww xingyaoww self-assigned this Apr 26, 2024
@Jiayi-Pan
Copy link
Contributor

Jiayi-Pan commented May 6, 2024

A nice way to improve open-source LLMs is by fine-tuning them with trajectories from stronger models like GPT-4. Bonus point if we can filter out the bad ones.

One way to achieve this at scale, similar to wildchat, is to provide officially hosted open-devin interfaces that come with a free GPT-4 based backend. In exchange for freely using these agents, users need to sign up to allow free distribution of the data and rank the quality of the agents' performance for us.

I imagine this could be used to:

  1. Obtain diverse, high-quality trajectories to fine-tune open agents.
  2. As a easy to start demo, attract more users
  3. Potentially use human preference data to create a Chatbot Arena equivalent for coding agent

@xingyaoww
Copy link
Collaborator

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

@Jiayi-Pan
Copy link
Contributor

Jiayi-Pan commented May 6, 2024

Thanks @Jiayi-Pan!! All of the bullet points mentioned are actually on our roadmap :))

Amazing and thanks for the pointer! I will have a look and see what I can contribute

@xingyaoww
Copy link
Collaborator

@Jiayi-Pan We are currently thinking about re-purposing the existing agent tuning dataset (e.g., code, agent tuning) for (1) so we can have a preliminary v0.1 OSS model :)

@BradKML
Copy link

BradKML commented Jun 3, 2024

Also does this feel like a technical foundation for building fine-tuning tool kits through generating quasi-synthetic data?

Copy link
Contributor

github-actions bot commented Sep 2, 2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Sep 2, 2024
@neubig
Copy link
Contributor Author

neubig commented Sep 2, 2024

We're still working on this!

@dorbanianas
Copy link
Collaborator

dorbanianas commented Sep 2, 2024

Hey @neubig , sorry for being late I was a bit busy these days and I was working on a small version but I had some resource limitations so I didn't progress.

@mamoodi mamoodi removed the Stale Inactive for 30 days label Sep 2, 2024
Copy link
Contributor

github-actions bot commented Oct 6, 2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Oct 6, 2024
@enyst enyst removed the Stale Inactive for 30 days label Oct 6, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Nov 11, 2024
@BradKML
Copy link

BradKML commented Nov 11, 2024

@Jiayi-Pan here is a bit of a leading question:

  1. What would the architecture of competitive coding LLM Arena look like? Would they be allowed to run their code multiple times to debug (and without limit for paywalled models)? Which type of judging criteria should we prioritize (code runtime vs code generation and debug time)?
  2. What would the architecture of a fine-tuning dataset generator look like? Should we include every single coding problem along side codebase debugging problems? Should we include diverse programming languages (including ones that would have memory issues)? Should we mix pure implementations with library use?
  3. (on a meta-level) Will the LLM be allowed to self-document programming methodologies (e.g. DS&A, design patterns, ML knowledge) between different mock-benchmarks? If so, then where would the mock-benchmark be sourced from that is unique from the core dataset that will be used to be compared with other SWE architectures?
  4. (bonus question) how would Chain-of-Thought and other adjacent architecture be handled? This could be different from just picking LLM architectures where it is the token that are being predicted, and instead where to turn token outputs back into inputs https://arxiv.org/html/2401.14295v3

@github-actions github-actions bot removed the Stale Inactive for 30 days label Dec 2, 2024
@neubig
Copy link
Contributor Author

neubig commented Dec 28, 2024

I think with deepseek-v3 and @Jiayi-Pan and @xingyaoww 's SWE-Gym project we now probably have open models that can achieve reasonable scores in OpenHands!

We still need to create a better leaderboard, but we can handle this isn a new issue: #5869

Congratulations to us on closing one of the oldest issues in our backlog :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request severity:medium Affecting multiple users
Projects
None yet
Development

No branches or pull requests