Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,6 @@ coverage-*.html
*.a

hack/python-sdk/openapi-generator-cli-*.jar

# Go build cache directories
.cache/
19 changes: 19 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,15 @@ GO_FMT ?= gofmt
# Use go.mod go version as a single source of truth for the Go version
GO_VERSION := $(shell awk '/^go /{print $$2}' go.mod|head -n1)

# Go build cache configuration
GOCACHE ?= $(shell pwd)/.cache/go-build
GOMODCACHE ?= $(shell pwd)/.cache/go-mod
export GOCACHE
export GOMODCACHE

# Ensure cache directories exist
$(shell mkdir -p $(GOCACHE) $(GOMODCACHE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The $(shell mkdir -p ...) command is executed every time make is invoked, during the parsing phase. While mkdir -p is idempotent, this is generally not a good practice as it can have side effects and is not explicit. It's better to create directories as part of a target's recipe or as a prerequisite for targets that need them. For example, you could create a .PHONY setup target and add it as a dependency to your build targets.


# Determine Docker build command (use nerdctl if available)
DOCKER_BUILD_CMD ?= docker

Expand All @@ -36,6 +45,9 @@ else
DOCKER_BUILD_CMD = nerdctl
endif

# Enable Docker BuildKit for cache mounts
export DOCKER_BUILDKIT=1

# CRD Options
CRD_OPTIONS ?= "crd:maxDescLen=0"

Expand Down Expand Up @@ -207,6 +219,13 @@ tidy: ## 📦 Run go mod tidy
@$(GO_CMD) mod tidy
@echo "✅ Dependencies cleaned up"

.PHONY: clean-cache
clean-cache: ## 🧹 Clean Go build cache
@echo "🧹 Cleaning Go build cache..."
@rm -rf $(GOCACHE) $(GOMODCACHE)
@$(GO_CMD) clean -cache -modcache
@echo "✅ Cache cleaned"

.PHONY: ci-lint
ci-lint: golangci-lint ## 🔎 Run golangci-lint against code.
@echo "🔎 Running golangci-lint..."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,15 @@ spec:
serviceAccountName: {{ .Values.modelAgent.serviceAccountName }}
affinity:
{{- toYaml .Values.modelAgent.affinity | nindent 8 }}
{{- if or .Values.modelAgent.gpuNodesOnly .Values.modelAgent.nodeSelector }}
nodeSelector:
{{- if .Values.modelAgent.gpuNodesOnly }}
{{ .Values.modelAgent.gpuNodeLabel.key }}: {{ .Values.modelAgent.gpuNodeLabel.value | quote }}
{{- end }}
{{- if .Values.modelAgent.nodeSelector }}
{{- toYaml .Values.modelAgent.nodeSelector | nindent 8 }}
{{- end }}
{{- end }}
{{- $imagePullSecrets := .Values.modelAgent.imagePullSecrets | default .Values.global.imagePullSecrets }}
{{- if $imagePullSecrets }}
imagePullSecrets:
Expand All @@ -40,7 +47,7 @@ spec:
{{- end }}
containers:
- name: model-agent
image: {{ include "ome.imageWithHub" (dict "values" .Values "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}
image: {{ include "ome.imageWithHub" (dict "values" (merge (dict "global" (dict "hub" (default .Values.global.hub .Values.modelAgent.image.hub))) .Values) "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic for overriding the image hub appears to have two issues:

  1. The default function arguments are swapped. It should be (default .Values.modelAgent.image.hub .Values.global.hub) to prioritize the model-agent specific hub, as documented in values.yaml.
  2. The merge function arguments are also swapped. Helm's merge gives precedence to the right-most dictionary, so .Values is currently overwriting your hub override. The arguments should be swapped to merge the override into .Values.

Here is the corrected line:

        image: {{ include "ome.imageWithHub" (dict "values" (merge .Values (dict "global" (dict "hub" (default .Values.modelAgent.image.hub .Values.global.hub)))) "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}

imagePullPolicy: {{ .Values.modelAgent.image.pullPolicy }}
ports:
- name: metrics
Expand Down
16 changes: 16 additions & 0 deletions charts/ome-resources/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,26 @@ modelAgent:
priorityClassName: system-node-critical
serviceAccountName: ome-model-agent
image:
# Docker registry hub. If set, overrides global.hub for model-agent.
# If repository contains '/', hub is ignored.
# Leave empty to use global.hub
hub: ""
repository: model-agent
pullPolicy: Always
tag: *defaultVersion

# When enabled, the model agent will only run on nodes with GPU
gpuNodesOnly: false

# GPU node label selector (used when gpuNodesOnly is true)
# Examples:
# For NVIDIA GPU operator: nvidia.com/gpu.present: "true"
# For Nebius: nebius.com/gpu: "true"
# For AWS: node.kubernetes.io/instance-type: g4dn.xlarge
gpuNodeLabel:
key: nvidia.com/gpu.present
value: "true"

nodeSelector: {}

# Additional volumes to mount into the model-agent DaemonSet pods
Expand Down
11 changes: 7 additions & 4 deletions dockerfiles/manager.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@ WORKDIR /workspace
COPY go.mod go.mod
COPY go.sum go.sum

# Download dependencies
RUN go mod download
# Download dependencies with Go module cache
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download

# Copy source code
COPY cmd/ cmd/
Expand All @@ -24,8 +25,10 @@ ARG VERSION
ARG GIT_TAG
ARG GIT_COMMIT

# Build the manager binary
RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
# Build the manager binary with Go build cache
RUN --mount=type=cache,target=/root/.cache/go-build \
--mount=type=cache,target=/go/pkg/mod \
CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
go build -a -installsuffix cgo \
-ldflags "-X github.com/sgl-project/ome/pkg/version.GitVersion=${GIT_TAG} -X github.com/sgl-project/ome/pkg/version.GitCommit=${GIT_COMMIT}" \
-o manager ./cmd/manager
Expand Down
32 changes: 25 additions & 7 deletions dockerfiles/model-agent.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Configurable base image - must be declared before any FROM statement
# Defaults to Oracle Linux 10 for OCI SDK compatibility
# Can be overridden with --build-arg BASE_IMAGE=ubuntu:22.04
ARG BASE_IMAGE=oraclelinux:10-slim

# Build the model-agent binary
FROM golang:1.24 AS builder

Expand All @@ -12,8 +17,9 @@ WORKDIR /workspace
COPY go.mod go.mod
COPY go.sum go.sum

# Download dependencies
RUN go mod download
# Download dependencies with Go module cache
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download

# Copy source code
COPY cmd/ cmd/
Expand All @@ -24,15 +30,27 @@ ARG VERSION
ARG GIT_TAG
ARG GIT_COMMIT

# Build the model-agent binary
RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
# Build the model-agent binary with Go build cache
RUN --mount=type=cache,target=/root/.cache/go-build \
--mount=type=cache,target=/go/pkg/mod \
CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
go build -a -installsuffix cgo \
-ldflags "-X github.com/sgl-project/ome/pkg/version.GitVersion=${GIT_TAG} -X github.com/sgl-project/ome/pkg/version.GitCommit=${GIT_COMMIT}" \
-o model-agent ./cmd/model-agent

# Use Oracle Linux 9 as base image for OCI SDK compatibility
FROM oraclelinux:10-slim
RUN microdnf update -y && microdnf clean all
# Use the base image specified at the top of the file
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

# Install/update packages based on the base image
RUN if [ -f /usr/bin/microdnf ]; then \
microdnf update -y && microdnf clean all; \
elif [ -f /usr/bin/apt-get ]; then \
apt-get update && \
apt-get install -y ca-certificates && \
apt-get upgrade -y && \
apt-get clean && rm -rf /var/lib/apt/lists/*; \
fi

COPY --from=builder /workspace/model-agent /
ENTRYPOINT ["/model-agent"]
11 changes: 7 additions & 4 deletions dockerfiles/multinode-prober.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@ WORKDIR /workspace
COPY go.mod go.mod
COPY go.sum go.sum

# Download dependencies
RUN go mod download
# Download dependencies with Go module cache
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download

# Copy source code
COPY cmd/ cmd/
Expand All @@ -24,8 +25,10 @@ ARG VERSION
ARG GIT_TAG
ARG GIT_COMMIT

# Build the multinode-prober binary
RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
# Build the multinode-prober binary with Go build cache
RUN --mount=type=cache,target=/root/.cache/go-build \
--mount=type=cache,target=/go/pkg/mod \
CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
go build -a -installsuffix cgo \
-ldflags "-X github.com/sgl-project/ome/pkg/version.GitVersion=${GIT_TAG} -X github.com/sgl-project/ome/pkg/version.GitCommit=${GIT_COMMIT}" \
-o multinode-prober ./cmd/multinode-prober
Expand Down
63 changes: 45 additions & 18 deletions dockerfiles/ome-agent.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Configurable base image - must be declared before any FROM statement
# Defaults to Oracle Linux 10 for OCI SDK compatibility
# Can be overridden with --build-arg BASE_IMAGE=ubuntu:22.04
ARG BASE_IMAGE=oraclelinux:10-slim

# Build the ome-agent binary
FROM golang:1.24 AS builder

Expand All @@ -23,39 +28,61 @@ WORKDIR /workspace
COPY go.mod go.mod
COPY go.sum go.sum

# Download dependencies
RUN go mod download
# Download dependencies with Go module cache
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download

# Copy XET dependencies from other pkg subdirectories
COPY pkg/configutils/ pkg/configutils/
COPY pkg/logging/ pkg/logging/

# Copy XET package for building with better caching
COPY pkg/xet/ pkg/xet/

# Download Rust dependencies with cargo cache
RUN --mount=type=cache,target=/root/.cargo/registry \
--mount=type=cache,target=/root/.cargo/git \
cd pkg/xet && cargo fetch

# Copy source code
# Build the XET library with cargo build cache
RUN --mount=type=cache,target=/root/.cargo/registry \
--mount=type=cache,target=/root/.cargo/git \
cd pkg/xet && \
cargo build --release

# Copy remaining source code
COPY cmd/ cmd/
COPY pkg/ pkg/
COPY internal/ internal/

# Build the XET library first
RUN cd pkg/xet && make build

# Build arguments for version info
ARG VERSION
ARG GIT_TAG
ARG GIT_COMMIT

# Build the ome-agent binary (CGO must be enabled for XET library)
RUN CGO_ENABLED=1 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
RUN --mount=type=cache,target=/root/.cache/go-build \
--mount=type=cache,target=/go/pkg/mod \
CGO_ENABLED=1 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} \
go build -a \
-ldflags "-X github.com/sgl-project/ome/pkg/version.GitVersion=${GIT_TAG} -X github.com/sgl-project/ome/pkg/version.GitCommit=${GIT_COMMIT}" \
-o ome-agent ./cmd/ome-agent

# Use Oracle Linux 9 as base image for OCI SDK compatibility
FROM oraclelinux:10-slim
RUN microdnf update -y && microdnf clean all

# Install runtime dependencies for the XET library
RUN microdnf install -y \
glibc \
libgcc \
libstdc++ \
openssl-libs \
&& microdnf clean all
# Use the base image specified at the top of the file
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

# Install/update packages and runtime dependencies based on the base image
RUN if [ -f /usr/bin/microdnf ]; then \
microdnf update -y && \
microdnf install -y glibc libgcc libstdc++ openssl-libs && \
microdnf clean all; \
elif [ -f /usr/bin/apt-get ]; then \
apt-get update && \
apt-get upgrade -y && \
apt-get install -y libc6 libgcc-s1 libstdc++6 libssl3 && \
apt-get clean && rm -rf /var/lib/apt/lists/*; \
fi

COPY --from=builder /workspace/ome-agent /
COPY config/ome-agent/ome-agent.yaml /
Expand Down
2 changes: 2 additions & 0 deletions pkg/hfutil/hub/repo.go
Original file line number Diff line number Diff line change
Expand Up @@ -288,9 +288,11 @@ func SnapshotDownload(ctx context.Context, config *DownloadConfig) (string, erro
if totalErrors > 0 {
// Collect error details
var errorFiles []string
var errorMessages []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

You've added errorMessages to collect more detailed errors, which is a great improvement for debuggability. However, this new slice is populated but never used later in the function. The function still prints the less informative errorFiles slice and returns a generic error message. Consider using errorMessages to provide more detailed feedback to the user on failure, for example by joining them into the returned error string or logging them.

for _, result := range results {
if result.err != nil {
errorFiles = append(errorFiles, filesToDownload[result.index].Path)
errorMessages = append(errorMessages, fmt.Sprintf("%s: %v", filesToDownload[result.index].Path, result.err))
}
}

Expand Down
Loading