Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Access Monitoring Port (Prometheus Metrics) on Kubeflow Trainer Controller Manager #2547

Open
izuku-sds opened this issue Mar 19, 2025 · 2 comments

Comments

@izuku-sds
Copy link

What happened?

Facing an issue accessing the monitoring port (or Prometheus metrics) in the Kubeflow Trainer Controller Manager running on the master branch. The instructions that worked in version v1.9.0 are not working in v2.0, and encountering a port-forwarding error.

Error:

$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
Handling connection for 8080
E0319 15:13:06.174141 3648479 portforward.go:424] "Unhandled Error" err="an error occurred forwarding 8080 -> 8080: error forwarding port 8080 to pod 71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339, uid : failed to execute portforward in network namespace \"/var/run/netns/cni-6a5c29c4-f90d-5175-3aeb-cdbca524d613\": failed to connect to localhost:8080 inside namespace \"71a61495b14b7ae8b610e860acd7bd7a0bd4beb7feb4bd66064ce245150ff339\", IPv4: dial tcp4 127.0.0.1:8080: connect: connection refused IPv6 dial tcp6 [::1]:8080: connect: connection refused "
error: lost connection to pod

Setup Instrcutions Followed:
cluster setup:

kind create cluster
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
sleep 120
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master"

env setup:

conda create --name issue python=3.11
conda activate issue
pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

What did you expect to happen?

prometheus metrics should be accessible at localhost:8080/metrics

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow-system -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/trainer/trainer-controller-manager:latest

Kubeflow Python SDK version:

$ pip show kubeflow
Name: kubeflow
Version: 0.1.0
Summary: Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.
Home-page: https://github.com/kubeflow/trainer
Author: 
Author-email: The Kubeflow Authors <[email protected]>
License: Apache License
Location: /home/izuku/miniconda3/envs/issue2/lib/python3.11/site-packages
Requires: kubernetes, pydantic
Required-by: 

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@izuku-sds
Copy link
Author

@tenzen-y related to above issue, as @milinddethe15 mentioned that by default metrics are disabled. Should I change that.

@izuku-sds
Copy link
Author

Adding following args to deployment yaml, gave access to monitoring port using port-forwarding.

args:
- "--metrics-bind-address=:8080"
- "--metrics-secure=false"

port forwarding:

$ kubectl port-forward -n kubeflow-system deployment/kubeflow-trainer-controller-manager 8080:8080

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant