Skip to content

feat: add tool to check if GPUs are available for training #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

vaibs-d
Copy link
Contributor

@vaibs-d vaibs-d commented Apr 20, 2025

This PR adds tools for the MLE agent to asses if GPUs are available via Ray or locally to use for model training.

@vaibs-d vaibs-d changed the title Add gpu support feat: add tool to check if GPUs are available for training Apr 20, 2025
@vaibs-d vaibs-d marked this pull request as ready for review April 21, 2025 03:35
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrge found 5 issues across 6 files. View them in mrge.io

Copy link
Contributor

@marcellodebernardi marcellodebernardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature @vaibs-d ! I have a couple of questions about how the new tools plug into the agentic workflow, see comments. Otherwise, excited to get this feature merged 🚀

def get_executor_tool(distributed: bool = False) -> Callable:
"""Get the appropriate executor tool based on the distributed flag."""
def get_executor_tool() -> Callable:
"""Get the executor tool for training code execution."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't need this to be wrapped in a closure anymore, since we're not passing any parameters to it.

@tool
def get_gpu_info() -> dict:
"""
Get available GPU information for code generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions about this tool:

  1. It doesn't seem to be added to the list of tools for any of the agents, unless I'm missing something?
  2. I'm not sure how the agent is expected to use this tool. Is the idea that the agent would check using this tool if different ML frameworks have access to GPU?

I think I see a couple of issues here:

  1. The docstring may be a little too terse for an agent to reliably understand what this is for.
  2. This should be added to an agent's tools, and the agent's prompts should maybe also be modified to suggest using this
  3. This will check for GPU availability on the local compute instance; if we're using a Ray cluster, this would not check for the presence of GPUs in the Ray cluster

@tool
def get_ray_info() -> dict:
"""
Get Ray cluster information including GPU availability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the tool above, I'm not entirely clear on how an agent should use this tool. The docstring probably doesn't give an agent enough context about what this tool is really "for" and how it fits in the broader workflow.

At least with GPT-4o and Claude 3.7 I've seen that the tool docstrings need to be very unambiguous for the agent to reliably use the tool correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants