[Feature Request] Cloud GPU & Remote Experiment Execution Backend

## Problem

The SciForge validation loop is hard-coded to run experiments locally via `subprocess.run(["python3.12", train_script], ...)` on the same machine as the Flask server. This means the system cannot execute real deep-learning experiments that require GPUs, nor can it scale beyond a single local CPU.

## Evidence from Codebase

- **Local-only execution**: `agents/validation_loop.py:88-95` directly invokes `python3.12 train.py` in a local subprocess with no abstraction for remote backends.
- **Explicit CPU fallback**: `agents/experiment_forge.py:78` states the bootstrap `train.py` must be "self-contained (only uses stdlib + numpy + scipy, no GPU required)".
- **Zero cloud/remote infrastructure**: No references to SSH, Docker, Kubernetes, Slurm, AWS, GCP, Azure, or any container orchestration in `agents/` or `orchestrator/`.
- **No GPU detection**: No calls to `nvidia-smi`, no CUDA availability checks, no GPU allocation logic anywhere.

## Missing Capability Matrix

| Capability | Status |
|---|---|
| Remote SSH connection to GPU server | ❌ Missing |
| Docker containerization per experiment | ❌ Missing |
| Cloud API (AWS/GCP/Azure/Lambda) | ❌ Missing |
| Cluster scheduler (Slurm, Ray, SageMaker) | ❌ Missing |
| GPU detection & allocation (CUDA) | ❌ Missing |
| Auto environment setup (requirements.txt / conda env) | ❌ Missing |
| Spot/preemptible instance support | ❌ Missing |

## A Key Contradiction

`agents/paper_idea_agent.py` and `paradigm_agent.py` generate experimental plans that *estimate* GPU requirements:

```json
"compute_budget": {
  "gpu_type": "A100-80GB",
  "total_gpu_hours": "Estimate",
  "estimated_cost": "$X at cloud rates"
}
```

However, the system has **no mechanism** to actually provision or schedule those GPUs. The plan text mentions cloud costs, but the execution layer cannot spend a single dollar or rent a single GPU.

## Impact

Because experiments are forced to run locally on CPU within a 5-minute timeout:
- **Real deep-learning experiments are impossible**. Anything involving PyTorch, transformers, or distributed training cannot execute.
- **Results are proxy-only**. The validation loop tests simplified toy versions of hypotheses, not the actual proposed methods.
- **Compute cost optimization is moot**. The system estimates GPU-hours but cannot leverage cloud spot instances or auto-scaling.
- **The closed loop is incomplete**. A genuine autonomous scientist must be able to run the experiments it designs, not just simulate them.

## Proposed Direction

Abstract the execution layer with a `ExperimentBackend` protocol:

1. **LocalBackend**: Preserve current behavior (subprocess on local machine).
2. **RemoteBackend**: SSH into a GPU server, rsync code, run experiment, fetch results.
3. **CloudBackend**: Integrate with a cloud provider (e.g., RunPod, Vast.ai, AWS EC2/GCP Compute) to:
   - Spin up a container matching `compute_budget.gpu_type`.
   - Auto-generate `Dockerfile` + `requirements.txt` from the experiment scaffold.
   - Stream logs back via WebSocket or polled API.
   - Terminate instance on completion to minimize cost.

Configuration should be environment-driven (e.g., `SCIFORGE_BACKEND=cloud`, `SCIFORGE_CLOUD_PROVIDER=runpod`, `SCIFORGE_API_KEY=...`).

---
*Without this, DeepGraph remains a "hypothesis generator" rather than a true "autonomous experimental scientist".*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Cloud GPU & Remote Experiment Execution Backend #7

Problem

Evidence from Codebase

Missing Capability Matrix

A Key Contradiction

Impact

Proposed Direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Capability	Status
Remote SSH connection to GPU server	❌ Missing
Docker containerization per experiment	❌ Missing
Cloud API (AWS/GCP/Azure/Lambda)	❌ Missing
Cluster scheduler (Slurm, Ray, SageMaker)	❌ Missing
GPU detection & allocation (CUDA)	❌ Missing
Auto environment setup (requirements.txt / conda env)	❌ Missing
Spot/preemptible instance support	❌ Missing

[Feature Request] Cloud GPU & Remote Experiment Execution Backend #7

Description

Problem

Evidence from Codebase

Missing Capability Matrix

A Key Contradiction

Impact

Proposed Direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions