Skip to content

Commit 4c4e34f

Browse files
liveaveragedrewmalinclaude
authored
feat(BREV-2138): Enable native Nebius integration (#62)
* Initial check-in for native Nebius integration * Code check-in * Eliminiate org, userid from cred struct * Revise instance types, pricing, and default-project targets * Ensure creation of dedicated VPCs, subnets * Add improved integration tests for start, stop,, SSH * Fixup SSH int tests * Add debug for Nebius client * Add debug logging for projectID corruption diagnosis * Set Cloud and Provider fields to 'nebius' for instance types * Support VRAM property, add logger/wrap&trace, failure handling cleanup, remove cloud property, use RefID for resource naming * fix: instanceType * Ensure stoppable is true * Retry formatting for instance type * Fixup instance type with dot-not * Rework provier-related failure handling, instance type lookup * Add waits for start/stop/terminate * ListInstance filler for state detect * Fixup for vanishing instances * implement tag filters * Increase logging * add deterministic ordering * Cleanup old response * default location * Add support for region determination * Clean up whitespace and formatting Remove trailing spaces and fix alignment in Nebius provider files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Ensure B200 support * Address PR feedback: Add assertions and cleanup - Add assertions to test methods to ensure non-zero results: * scripts/images_test.go: Test_EnumerateImages * scripts/instancetypes_test.go: Test_EnumerateInstanceTypesSingleRegion, Test_EnumerateGPUTypes * integration_test.go: TestIntegration_GetInstanceTypes, TestIntegration_GetImages - Remove debug statements from production code (client.go, instancetype.go, credential.go, instance.go) - Remove emojis from test output (smoke_test.go, integration_test.go) - Remove unused extractOSFamily function from image.go - Delete unnecessary markdown files (keep only README.md) - Remove .gitignore for markdown files Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Address deprecation feedback from PR review - Replace DiskSize with DiskSizeBytes in instance.go - Replace Memory with MemoryBytes in instancetype.go - Remove emojis from logger statements in instance.go - Change logger.Info to logger.Debug for "building instance type" log Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix validation test compilation errors Update validation tests to use new NewNebiusCredential signature: - validation_kubernetes_test.go: Use NEBIUS_SERVICE_ACCOUNT_JSON and NEBIUS_TENANT_ID - validation_network_test.go: Use NEBIUS_SERVICE_ACCOUNT_JSON and NEBIUS_TENANT_ID The new credential format expects: - serviceAccountJSON: JSON service account key (or file path) - tenantID: Nebius tenant ID Previous format used individual credential components which have been consolidated. Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix CI linting issues - Change validation test failures to skips when env vars missing (tests should skip, not fail, when credentials aren't configured) - Fix errcheck issues in integration_test.go: * Explicitly ignore Close() errors in defers * Check fmt.Sscanf error return - Add nolint comments for high cognitive complexity functions: * instancetype.go: getInstanceTypesForLocation * instance.go: parseInstanceType, getWorkingPublicImageID, ListInstances, convertNebiusInstanceToV1 * integration_test.go: TestIntegration_InstanceLifecycle (funlen) These functions are intentionally complex due to: - Multiple fallback strategies - Extensive error handling - Field mapping from provider to v1 types - Complete test lifecycle coverage Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix all CI linting and test failures for Nebius provider This commit addresses all golangci-lint warnings and test failures reported in CI. All fixes were verified locally using golangci-lint v2.6.0 before committing. Test Fixes: - Update client_test.go to expect all 12 capabilities (VPC, managed-k8s, firewall, userdata) Linting Fixes (21 issues → 0): - Add nolint:funlen for 5 legitimately complex functions - Add nolint:gocyclo for 4 test functions with high cyclomatic complexity - Add nolint:goconst for architecture and GPU type comparison strings - Fix context.Context parameter ordering in 8 test helper functions - Rename 6 unused context parameters to "_" in not-yet-implemented functions - Add nolint:unparam for 3 functions that currently return nil error - Run gofumpt on 5 files to fix formatting issues - Remove unused/incorrect nolint directives All changes verified locally: - golangci-lint v2.6.0: 0 issues - go build: ✅ Success - go test -c: ✅ Success - TestNebiusClient_GetCapabilities: ✅ Pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix remaining goconst linting issues - Add nolint directive to instance_test.go for GPU type comparisons - Remove unused nolint directive from instance.go All linting checks now pass locally with golangci-lint v2.6.0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix goconst linting issues for GPU type comparisons - Add nolint directive to instance_test.go for GPU type string comparisons - Remove unused nolint directives from instance.go - Verified with: /tmp/golangci-lint run (0 issues) The goconst linter was flagging 4 occurrences of "cpu" string across instance.go and instance_test.go. Adding the nolint to the test file suppresses all occurrences without needing directives in the main code. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Replace cpu string literal with platformTypeCPU constant The goconst linter requires string literals used 4+ times to be constants. Created platformTypeCPU constant and replaced all "cpu" string literals in instance.go and instance_test.go. - Add const platformTypeCPU = "cpu" - Replace all "cpu" string literals with platformTypeCPU - Remove unused nolint directive from instance_test.go - Verified with: /tmp/golangci-lint run (0 issues) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix isPlatformSupported to reject unknown GPU platforms The function was accepting any platform containing "gpu" in the name, including invalid platforms like "random-gpu". Now it only accepts platforms with known GPU model names (h100, h200, l40s, a100, etc). Changes: - Remove generic "gpu" from indicators list - Rename to knownGPUTypes for clarity - Only accept platforms containing specific GPU model names - Platforms like "gpu-h100-sxm" and "h100-sxm" still work (both contain "h100") - "random-gpu" now correctly returns false Fixes test: TestIsPlatformSupported/Random_name_with_gpu 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Drew Malin <[email protected]> Co-authored-by: Claude <[email protected]>
1 parent 2d1d12c commit 4c4e34f

25 files changed

+5929
-364
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.env
22
__debug_bin*
33
.idea/*
4-
coverage/*
4+
.claude
5+
coverage/*

v1/providers/nebius/CONTRIBUTE.md

Lines changed: 0 additions & 77 deletions
This file was deleted.

v1/providers/nebius/README.md

Lines changed: 140 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,12 +81,148 @@ Nebius AI Cloud is known for:
8181
- Integration with VPC, IAM, billing, and quota services
8282
- Container registry and managed services
8383

84+
## Implementation Notes
85+
86+
### Platform Name vs Platform ID
87+
The Nebius API requires **platform NAME** (e.g., `"gpu-h100-sxm"`) in `ResourcesSpec.Platform`, **NOT** platform ID (e.g., `"computeplatform-e00caqbn6nysa972yq"`). The `parseInstanceType` function must always return `platform.Metadata.Name`, not `platform.Metadata.Id`.
88+
89+
### Instance Type ID Preservation
90+
**Critical**: When creating instances, the SDK stores the full instance type ID (e.g., `"gpu-h100-sxm.8gpu-128vcpu-1600gb"`) in metadata labels (`instance-type-id`). When retrieving instances via `GetInstance`, the SDK:
91+
92+
1. **Retrieves the stored ID** from the `instance-type-id` label
93+
2. **Populates both** `Instance.InstanceType` and `Instance.InstanceTypeID` with this full ID
94+
3. **Falls back to reconstruction** from platform + preset if the label is missing (backwards compatibility)
95+
96+
This ensures that dev-plane can correctly look up the instance type in the database without having to derive it from provider-specific naming conventions like `"<provider>-<region>-<subregion>-<platform>"`.
97+
98+
**Without this**, dev-plane would construct an incorrect ID like `"nebius-brev-dev1-eu-north1-noSub-gpu-l40s"` which doesn't exist in the database, causing `"ent: instance_type not found"` errors.
99+
100+
### GPU VRAM Mapping
101+
GPU memory (VRAM) is populated via static mapping since the Nebius SDK doesn't natively provide this information:
102+
- L40S: 48 GiB
103+
- H100: 80 GiB
104+
- H200: 141 GiB
105+
- A100: 80 GiB
106+
- V100: 32 GiB
107+
- A10: 24 GiB
108+
- T4: 16 GiB
109+
- L4: 24 GiB
110+
- B200: 192 GiB
111+
112+
See `getGPUMemory()` in `instancetype.go` for the complete mapping.
113+
114+
### Logging Support
115+
The Nebius provider supports structured logging via the `v1.Logger` interface. To enable logging:
116+
117+
```go
118+
import (
119+
nebiusv1 "github.com/brevdev/cloud/v1/providers/nebius"
120+
"github.com/brevdev/cloud/v1"
121+
)
122+
123+
// Create a logger (implement v1.Logger interface)
124+
logger := myLogger{}
125+
126+
// Option 1: Via credential
127+
cred := nebiusv1.NewNebiusCredential(refID, serviceKey, tenantID)
128+
client, err := cred.MakeClientWithOptions(ctx, location, nebiusv1.WithLogger(logger))
129+
130+
// Option 2: Via direct client construction
131+
client, err := nebiusv1.NewNebiusClientWithOrg(ctx, refID, serviceKey, tenantID, projectID, orgID, location, nebiusv1.WithLogger(logger))
132+
```
133+
134+
Without a logger, the client defaults to `v1.NoopLogger{}` which discards all log messages.
135+
136+
### Error Tracing
137+
Critical error paths use `errors.WrapAndTrace()` from `github.com/brevdev/cloud/internal/errors` to add stack traces and detailed context to errors. This improves debugging when errors propagate through the system.
138+
139+
### Resource Naming and Correlation
140+
All Nebius resources (instances, VPCs, subnets, boot disks) are named using the `RefID` (environment ID) for easy correlation:
141+
- VPC: `{refID}-vpc`
142+
- Subnet: `{refID}-subnet`
143+
- Boot Disk: `{refID}-boot-disk`
144+
- Instance: `{refID}`
145+
146+
All resources include the `environment-id` label for filtering and tracking.
147+
148+
### Automatic Cleanup on Failure
149+
If instance creation fails at any step, all created resources are automatically cleaned up to prevent orphaned resources:
150+
- **Instances** (if created but failed to reach RUNNING state)
151+
- **Boot disks**
152+
- **Subnets**
153+
- **VPC networks**
154+
155+
**How it works:**
156+
1. After the instance creation API call succeeds, the SDK waits for the instance to reach **RUNNING** state (5-minute timeout)
157+
2. If the instance enters a terminal failure state (ERROR, FAILED) or times out, cleanup is triggered
158+
3. The cleanup handler deletes **all** correlated resources (instance, boot disk, subnet, VPC) in the correct order
159+
4. Only when the instance reaches RUNNING state is cleanup disabled
160+
161+
This prevents orphaned resources when:
162+
- The Nebius API call succeeds but the instance fails to start due to provider issues
163+
- The instance is created but never transitions to a usable state
164+
- Network/timeout errors occur during instance provisioning
165+
166+
The cleanup is handled via a deferred function that tracks all created resource IDs and deletes them if the operation doesn't complete successfully.
167+
168+
### State Transition Waiting
169+
The SDK properly waits for instances to reach their target states after issuing operations:
170+
171+
- **CreateInstance**: Waits for `RUNNING` state (5-minute timeout) before returning
172+
- **StopInstance**: Issues stop command, then waits for `STOPPED` state (3-minute timeout)
173+
- **StartInstance**: Issues start command, then waits for `RUNNING` state (5-minute timeout)
174+
- **TerminateInstance**: Issues delete command, then waits for instance to be fully deleted (5-minute timeout)
175+
176+
**Why this is critical**: Nebius operations complete when the action is *initiated*, not when the instance reaches the final state. Without explicit state waiting:
177+
- Stop operations would return while instance is still `STOPPING`, causing UI to hang
178+
- Start operations would return while instance is still `STARTING`, before it's accessible
179+
- Delete operations would return while instance is still `DELETING`, leaving UI stuck
180+
- State polling on the frontend would show stale states
181+
182+
The SDK uses `waitForInstanceState()` and `waitForInstanceDeleted()` helpers which poll instance status every 5 seconds until the target state is reached or a timeout occurs.
183+
184+
### Instance Listing and State Polling
185+
**ListInstances** is fully implemented and enables dev-plane to poll instance states:
186+
187+
- Queries all instances across ALL projects in the tenant (projects are region-specific in Nebius)
188+
- Automatically determines the region for each instance from its parent project
189+
- Converts each instance to `v1.Instance` with the correct `Location` field set to the instance's actual region
190+
- **Properly filters by `TagFilters`, `InstanceIDs`, and `Locations`** passed in `ListInstancesArgs`
191+
- Returns instances with current state (RUNNING, STOPPED, DELETING, etc.)
192+
- Enables dev-plane's `WaitForChangedInstancesAndUpdate` workflow to track state changes
193+
194+
**Multi-Region Enumeration:**
195+
When a Nebius client is created with an empty `location` (e.g., from dev-plane's cloud credential without a specific region context), `ListInstances` automatically:
196+
1. Discovers all projects in the tenant via IAM API
197+
2. Extracts the region from each project name (e.g., "default-project-eu-north1" → "eu-north1")
198+
3. Queries instances from each project
199+
4. Sets each instance's `Location` field to its actual region (from the project-to-region mapping)
200+
201+
This prevents the issue where instances would have `Location = ""` (from the client's empty location), causing location-based filtering to incorrectly exclude all instances and mark them as terminated in dev-plane.
202+
203+
**Tag Filtering is Critical** - This is a fundamental architectural difference from Shadeform/Launchpad:
204+
205+
**Why Nebius REQUIRES Tag Filtering:**
206+
- **Shadeform & Launchpad**: Single-tenant per API key. Each cloud credential only sees its own instances through API-level isolation.
207+
- **Nebius**: Multi-tenant project. Multiple dev-plane cloud credentials can share one Nebius project. Without tag filtering, `ListInstances` returns ALL instances in the project, including those from other services/organizations.
208+
209+
**How Tag Filtering Works:**
210+
1. Dev-plane calls `ListInstances` with `TagFilters` (e.g., `{"devplane-service": ["dev-plane"], "devplane-org": ["org-xyz"]}`)
211+
2. Nebius SDK queries ALL instances in the project
212+
3. SDK filters results to only return instances where **all** specified tags match
213+
4. Dev-plane builds a map of cloud instances by CloudID
214+
5. For each database instance, checks if it exists in the cloud map
215+
6. If NOT in map → marks as TERMINATED (line 3011-3024 in `dev-plane/internal/instance/service.go`)
216+
217+
**Without Tag Filtering:**
218+
1. `ListInstances` returns instances with mismatched/missing tags
219+
2. dev-plane's instance is excluded from filtered results
220+
3. dev-plane's `getInstancesChangeSet` sees instance missing from cloud → marks as TERMINATED
221+
4. `WaitForInstanceToBeRunning` queries database → sees TERMINATED → fails with "instance terminated" error
222+
5. `BuildEnvironment` workflow fails, orphaning all cloud resources
223+
84224
## TODO
85225

86-
- [ ] Implement actual API integration for supported features
87-
- [ ] Add proper service account authentication handling
88226
- [ ] Add comprehensive error handling and retry logic
89-
- [ ] Add logging and monitoring
90-
- [ ] Add comprehensive testing
91227
- [ ] Investigate VPC integration for networking features
92228
- [ ] Verify instance type changes work correctly via ResourcesSpec.preset field

v1/providers/nebius/SECURITY.md

Lines changed: 0 additions & 102 deletions
This file was deleted.

v1/providers/nebius/capabilities.go

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,27 @@ import (
66
v1 "github.com/brevdev/cloud/v1"
77
)
88

9+
// getNebiusCapabilities returns the unified capabilities for Nebius AI Cloud
10+
// Based on Nebius compute API and our implementation
911
func getNebiusCapabilities() v1.Capabilities {
1012
return v1.Capabilities{
11-
// SUPPORTED FEATURES (with API evidence):
13+
// SUPPORTED FEATURES:
1214

1315
// Instance Management
14-
v1.CapabilityCreateInstance, // Nebius compute API supports instance creation
15-
v1.CapabilityTerminateInstance, // Nebius compute API supports instance deletion
16+
v1.CapabilityCreateInstance, // Nebius compute instance creation
17+
v1.CapabilityTerminateInstance, // Nebius compute instance termination
1618
v1.CapabilityCreateTerminateInstance, // Combined create/terminate capability
17-
v1.CapabilityRebootInstance, // Nebius supports instance restart operations
18-
v1.CapabilityStopStartInstance, // Nebius supports instance stop/start operations
19+
v1.CapabilityRebootInstance, // Nebius instance restart
20+
v1.CapabilityStopStartInstance, // Nebius instance stop/start operations
21+
v1.CapabilityResizeInstanceVolume, // Nebius volume resizing
1922

20-
v1.CapabilityModifyFirewall, // Nebius has Security Groups for firewall management
21-
v1.CapabilityMachineImage, // Nebius supports custom machine images
22-
v1.CapabilityResizeInstanceVolume, // Nebius supports disk resizing
23-
v1.CapabilityTags, // Nebius supports resource tagging
24-
v1.CapabilityInstanceUserData, // Nebius supports user data in instance creation
25-
v1.CapabilityVPC, // Nebius supports VPCs
26-
v1.CapabilityManagedKubernetes, // Nebius supports managed Kubernetes clusters
23+
// Resource Management
24+
v1.CapabilityModifyFirewall, // Nebius has Security Groups for firewall management
25+
v1.CapabilityMachineImage, // Nebius supports custom machine images
26+
v1.CapabilityTags, // Nebius supports resource tagging
27+
v1.CapabilityInstanceUserData, // Nebius supports user data in instance creation
28+
v1.CapabilityVPC, // Nebius supports VPCs
29+
v1.CapabilityManagedKubernetes, // Nebius supports managed Kubernetes clusters
2730
}
2831
}
2932

0 commit comments

Comments
 (0)