Deep dive into Mock NVML's design and implementation.
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER SPACE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌─────────────────────────────────────┐ │
│ │ nvidia-smi │ │ Your Application │ │
│ │ (real binary) │ │ (k8s-device-plugin, dcgm, etc) │ │
│ └────────┬─────────┘ └──────────────┬──────────────────────┘ │
│ │ │ │
│ │ dlopen("libnvidia-ml.so") │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ libnvidia-ml.so (MOCK) │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ CGo Bridge Layer │ │ │
│ │ │ - 400 C function exports (//export directives) │ │ │
│ │ │ - C struct definitions (nvmlPciInfo_t, nvmlMemory_t, etc) │ │ │
│ │ │ - Type conversions (C ↔ Go) │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────────────────▼───────────────────────────────────┐ │ │
│ │ │ Engine Layer │ │ │
│ │ │ - Singleton lifecycle management │ │ │
│ │ │ - Configuration loading (YAML or env vars) │ │ │
│ │ │ - Handle table (C pointer ↔ Go object mapping) │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────────────────▼───────────────────────────────────┐ │ │
│ │ │ ConfigurableDevice │ │ │
│ │ │ - 89 NVML method implementations │ │ │
│ │ │ - YAML-driven property values │ │ │
│ │ │ - Wraps dgxa100.Device (go-nvml mock) │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Directory: bridge/ (multiple files with IDE support)
The bridge exposes NVML functions as C symbols that applications can dynamically load. The bridge is organized into hand-written implementation files plus auto-generated stubs:
| File | Purpose |
|---|---|
cgo_types.go |
Shared CGo type definitions (C structs, constants) |
helpers.go |
Helper functions (toReturn, goStringToC, stubReturn) + main() |
init.go |
Initialization: nvmlInit_v2, nvmlShutdown, etc. |
device.go |
Device handles: nvmlDeviceGetCount, GetHandleByIndex, GetName, etc. |
system.go |
System functions: nvmlSystemGetDriverVersion, GetCudaDriverVersion, etc. |
internal.go |
Internal export table for nvidia-smi compatibility |
stubs_generated.go |
Auto-generated stubs for unimplemented functions |
//export nvmlDeviceGetTemperature
func nvmlDeviceGetTemperature(device C.nvmlDevice_t, sensorType C.nvmlTemperatureSensors_t,
temp *C.uint) C.nvmlReturn_t {
// 1. Look up Go device from C handle
dev := engine.GetEngine().LookupConfigurableDevice(uintptr(device))
if dev == nil {
return C.NVML_ERROR_INVALID_ARGUMENT
}
// 2. Call Go implementation
temperature, ret := dev.GetTemperature(nvml.TemperatureSensors(sensorType))
// 3. Convert result to C types
*temp = C.uint(temperature)
return toReturn(ret)
}C Type Definitions (CGo preamble):
typedef struct nvmlPciInfo_st {
char busIdLegacy[16];
unsigned int domain;
unsigned int bus;
unsigned int device;
unsigned int pciDeviceId;
unsigned int pciSubSystemId;
char busId[32];
} nvmlPciInfo_t;
typedef struct nvmlMemory_st {
unsigned long long total;
unsigned long long free;
unsigned long long used;
} nvmlMemory_t;File: engine/engine.go (~400 lines)
The Engine is the central coordinator, managing:
- Lifecycle: Init/Shutdown reference counting
- Configuration: Loading from YAML or environment
- Handle mapping: Translating C pointers to Go objects
type Engine struct {
server *MockServer // Device provider
config *Config // Loaded configuration
handles *HandleTable // C↔Go handle mapping
initCount int // Reference count
mu sync.RWMutex // Thread safety
}Singleton Pattern:
var (
engineInstance *Engine
engineOnce sync.Once
)
func GetEngine() *Engine {
engineOnce.Do(func() {
engineInstance = NewEngine(nil)
})
return engineInstance
}File: engine/handles.go (~170 lines)
Problem: CGo doesn't allow passing Go pointers with nested Go pointers to C code. When nvidia-smi receives a device handle, it expects to dereference it.
Solution: Allocate real C memory blocks that nvidia-smi can safely access.
// C structure that nvidia-smi can dereference
typedef struct {
unsigned int magic; // 0x4E564D4C ("NVML")
unsigned int index; // Device index
void* reserved[4]; // Space nvidia-smi might read
} HandleBlock;func (ht *HandleTable) Register(dev nvml.Device) uintptr {
// Allocate C memory block
cHandle := C.allocHandle(C.uint(deviceIndex))
handle := uintptr(unsafe.Pointer(cHandle))
// Store bidirectional mapping
ht.devices[handle] = dev
ht.reverse[dev] = handle
return handle
}Files: engine/config.go (~350 lines), engine/config_types.go (418 lines)
YAMLConfig:
├── SystemConfig # Driver version, CUDA version
├── DeviceDefaults # Default properties for all devices
└── Devices[] # Per-device overrides
├── index: 0
│ └── (overrides)
├── index: 1
│ └── (overrides)
└── ...func (c *Config) GetDeviceConfig(index int) *DeviceConfig {
// Start with defaults
merged := c.YAMLConfig.DeviceDefaults
// Apply per-device overrides
for _, override := range c.YAMLConfig.Devices {
if override.Index == index {
mergeDeviceOverride(&merged, &override)
break
}
}
return &merged
}File: engine/device.go (~1290 lines)
Implements 89 NVML methods by reading from YAML configuration.
type ConfigurableDevice struct {
*dgxa100.Device // Base device (embedded)
config *DeviceConfig // YAML configuration
index int
minorNumber int
bar1Memory nvml.BAR1Memory // Cached
pciInfo nvml.PciInfo // Cached
}Method Implementation Pattern:
func (d *ConfigurableDevice) GetTemperature(sensor nvml.TemperatureSensors) (uint32, nvml.Return) {
// Check if config provides value
if d.config != nil && d.config.Thermal != nil {
return uint32(d.config.Thermal.TemperatureGPU_C), nvml.SUCCESS
}
// No config = not supported
return 0, nvml.ERROR_NOT_SUPPORTED
}nvidia-smi Engine Config
│ │ │
│ nvmlInit_v2() │ │
│───────────────────────►│ │
│ │ LoadConfig() │
│ │────────────────────────►│
│ │ │
│ │ ┌───────────────────┤
│ │ │ YAML exists? │
│ │ └─────────┬─────────┘
│ │ │
│ │ YES: Parse YAML
│ │ NO: Use env vars
│ │ │
│ │◄──────────────┘
│ │
│ │ createServer()
│ │ - Create dgxa100.Server
│ │ - Create ConfigurableDevices
│ │ - Apply system config
│ │
│◄───────────────────────│ NVML_SUCCESS
nvidia-smi Bridge Engine Device
│ │ │ │
│ GetTemperature(dev,0,&t) │ │
│───────────────────────►│ │ │
│ │ LookupDevice(dev) │ │
│ │──────────────────►│ │
│ │ │ Lookup(handle) │
│ │◄──────────────────│ │
│ │ │ │
│ │ GetTemperature(0) │ │
│ │───────────────────┼───────────────►│
│ │ │ │
│ │ │ config.Thermal│
│ │ │ .TempGPU_C │
│ │◄──────────────────┼────────────────│
│ │ │ 33, SUCCESS │
│◄───────────────────────│ │ │
│ temp=33 │ │ │
| Pattern | Component | Purpose |
|---|---|---|
| Singleton | Engine | Single lifecycle manager |
| Decorator | ConfigurableDevice wraps dgxa100.Device | Extend without modifying |
| Strategy | createDevicesFromYAML vs createDefaultDevices | Runtime behavior selection |
| Handle Table | HandleTable | Safe C↔Go pointer translation |
| Config Merge | mergeDeviceOverride | Defaults + overrides |
pkg/gpu/mocknvml/
├── bridge/
│ ├── cgo_types.go # Shared CGo type definitions
│ ├── helpers.go # Helper functions + main() + go:generate
│ ├── init.go # nvmlInit_v2, nvmlShutdown, etc.
│ ├── device.go # Device handle functions
│ ├── events.go # Event set/wait functions
│ ├── system.go # System functions
│ ├── internal.go # Internal export table (nvidia-smi)
│ ├── nvml_types.h # C type definitions for CGo preamble
│ └── stubs_generated.go # Auto-generated stubs (~289 functions)
├── engine/
│ ├── config.go # Config loading
│ ├── config_types.go # YAML structs
│ ├── device.go # ConfigurableDevice
│ ├── engine.go # Singleton engine
│ ├── handles.go # Handle table
│ ├── invalid_device.go # Invalid device handle sentinel
│ ├── utils.go # Debug logging
│ ├── version.go # NVML version responses
│ └── *_test.go # Unit tests
├── configs/
│ ├── mock-nvml-config-a100.yaml
│ ├── mock-nvml-config-b200.yaml
│ ├── mock-nvml-config-gb200.yaml
│ ├── mock-nvml-config-h100.yaml
│ ├── mock-nvml-config-l40s.yaml
│ └── mock-nvml-config-t4.yaml
├── Dockerfile
├── Makefile
└── README.md
cmd/generate-bridge/
├── main.go # Stub generator (--stats, --validate flags)
├── parser.go # nvml.h prototype parser
└── main_test.go # Generator tests
All public Engine methods are protected by sync.RWMutex:
- Read operations (
DeviceGetCount,LookupDevice): UseRLock - Write operations (
Init,Shutdown,DeviceGetHandleByIndex): UseLock
The HandleTable also has its own mutex for independent locking.
- Allocated via
calloc()in CGo - Freed on
Engine.Shutdown()viaHandleTable.Clear() - Each handle is ~40 bytes
- C strings for
nvmlErrorStringare cached permanently - Matches real NVML behavior (static strings)
- Prevents memory leaks from repeated allocations
The CUDA mock follows the same engine/bridge pattern as NVML but at a smaller scale (15 functions vs 400).
See CUDA Mock for full details.
See Development Guide for:
- Adding new NVML function implementations
- Creating custom GPU profiles
- Regenerating the bridge code