Skip to content

[Feature Request] Cloud GPU & Remote Experiment Execution Backend #7

@koen666

Description

@koen666

Problem

The SciForge validation loop is hard-coded to run experiments locally via subprocess.run(["python3.12", train_script], ...) on the same machine as the Flask server. This means the system cannot execute real deep-learning experiments that require GPUs, nor can it scale beyond a single local CPU.

Evidence from Codebase

  • Local-only execution: agents/validation_loop.py:88-95 directly invokes python3.12 train.py in a local subprocess with no abstraction for remote backends.
  • Explicit CPU fallback: agents/experiment_forge.py:78 states the bootstrap train.py must be "self-contained (only uses stdlib + numpy + scipy, no GPU required)".
  • Zero cloud/remote infrastructure: No references to SSH, Docker, Kubernetes, Slurm, AWS, GCP, Azure, or any container orchestration in agents/ or orchestrator/.
  • No GPU detection: No calls to nvidia-smi, no CUDA availability checks, no GPU allocation logic anywhere.

Missing Capability Matrix

Capability Status
Remote SSH connection to GPU server ❌ Missing
Docker containerization per experiment ❌ Missing
Cloud API (AWS/GCP/Azure/Lambda) ❌ Missing
Cluster scheduler (Slurm, Ray, SageMaker) ❌ Missing
GPU detection & allocation (CUDA) ❌ Missing
Auto environment setup (requirements.txt / conda env) ❌ Missing
Spot/preemptible instance support ❌ Missing

A Key Contradiction

agents/paper_idea_agent.py and paradigm_agent.py generate experimental plans that estimate GPU requirements:

"compute_budget": {
  "gpu_type": "A100-80GB",
  "total_gpu_hours": "Estimate",
  "estimated_cost": "$X at cloud rates"
}

However, the system has no mechanism to actually provision or schedule those GPUs. The plan text mentions cloud costs, but the execution layer cannot spend a single dollar or rent a single GPU.

Impact

Because experiments are forced to run locally on CPU within a 5-minute timeout:

  • Real deep-learning experiments are impossible. Anything involving PyTorch, transformers, or distributed training cannot execute.
  • Results are proxy-only. The validation loop tests simplified toy versions of hypotheses, not the actual proposed methods.
  • Compute cost optimization is moot. The system estimates GPU-hours but cannot leverage cloud spot instances or auto-scaling.
  • The closed loop is incomplete. A genuine autonomous scientist must be able to run the experiments it designs, not just simulate them.

Proposed Direction

Abstract the execution layer with a ExperimentBackend protocol:

  1. LocalBackend: Preserve current behavior (subprocess on local machine).
  2. RemoteBackend: SSH into a GPU server, rsync code, run experiment, fetch results.
  3. CloudBackend: Integrate with a cloud provider (e.g., RunPod, Vast.ai, AWS EC2/GCP Compute) to:
    • Spin up a container matching compute_budget.gpu_type.
    • Auto-generate Dockerfile + requirements.txt from the experiment scaffold.
    • Stream logs back via WebSocket or polled API.
    • Terminate instance on completion to minimize cost.

Configuration should be environment-driven (e.g., SCIFORGE_BACKEND=cloud, SCIFORGE_CLOUD_PROVIDER=runpod, SCIFORGE_API_KEY=...).


Without this, DeepGraph remains a "hypothesis generator" rather than a true "autonomous experimental scientist".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions