Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager #2547

izuku-sds · 2025-03-19T10:57:30Z

What happened?

Facing an issue accessing the monitoring port (or Prometheus metrics) in the Kubeflow Trainer Controller Manager running on the master branch. The instructions that worked in version v1.9.0 are not working in v2.0, and encountering a port-forwarding error.

Error:

$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
E0319 15:13:06.174141 3648479 portforward.go:424] "Unhandled Error" err="an error occurred forwarding 8080 -> 8080: error forwarding port 8080 to pod 71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339, uid : failed to execute portforward in network namespace \"/var/run/netns/cni-6a5c29c4-f90d-5175-3aeb-cdbca524d613\": failed to connect to localhost:8080 inside namespace \"71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339\", IPv4: dial tcp4 127.0.0.1:8080: connect: connection refused IPv6 dial tcp6 [::1]:8080: connect: connection refused "
error: lost connection to pod

Setup Instrcutions Followed:
cluster setup:

kind create cluster
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
sleep 120
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master"

env setup:

conda create --name issue python=3.11
conda activate issue
pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

What did you expect to happen?

prometheus metrics should be accessible at localhost:8080/metrics

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:latest

Kubeflow Python SDK version:

$ pip show kubeflow
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/trainer
Author: 
Author-email: The Kubeflow Authors <[email protected]>
License: Apache License
Location: /home/izuku/miniconda3/envs/issue2/lib/python3.11/site-packages
Requires: kubernetes, pydantic
Required-by:

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

izuku-sds · 2025-03-20T13:40:01Z

@tenzen-y related to above issue, as @milinddethe15 mentioned that by default metrics are disabled. Should I change that.

izuku-sds · 2025-03-21T20:31:53Z

Adding following args to deployment yaml, gave access to monitoring port using port-forwarding.

args:
- "--metrics-bind-address=:8080"
- "--metrics-secure=false"

port forwarding:

$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080

izuku-sds added kind/bug lifecycle/needs-triage labels Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager #2547

Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager #2547

izuku-sds commented Mar 19, 2025

izuku-sds commented Mar 20, 2025

izuku-sds commented Mar 21, 2025

Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager #2547

Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager #2547

Comments

izuku-sds commented Mar 19, 2025

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

izuku-sds commented Mar 20, 2025

izuku-sds commented Mar 21, 2025