Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/build-and-push.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,10 @@ jobs:

rm -Rf $(ls . | grep -v config)
rm -Rf .gitignore .dockerignore .github .git .yamllint.yaml

cat ./config/base/params.env
cat ./config/overlays/odh/params.env
cat ./config/overlays/rhoai/params.env
# pysh to ci-manifest repo
- uses: cpina/github-action-push-to-another-repository@main
if: env.BUILD_CONTEXT == 'ci'
Expand Down
3 changes: 3 additions & 0 deletions Dockerfile.lmes-job
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ FROM registry.access.redhat.com/ubi9/python-311@sha256:fccda5088dd13d2a3f2659e4c
USER root
RUN sed -i.bak 's/include-system-site-packages = false/include-system-site-packages = true/' /opt/app-root/pyvenv.cfg

# required dependency for oci.py in lmes-job; put here for the `needs-lmes-build`, but already incorporated here: https://github.com/opendatahub-io/lm-evaluation-harness/blob/3c4dec006a4a096b546d60c3364c78acbe33cd48/Dockerfile.lmes-job#L16
RUN dnf install -y skopeo && dnf clean all

USER default
WORKDIR /opt/app-root/src
RUN mkdir /opt/app-root/src/hf_home && chmod g+rwx /opt/app-root/src/hf_home
Expand Down
11 changes: 11 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -263,3 +263,14 @@ catalog-build: opm ## Build a catalog image.
.PHONY: catalog-push
catalog-push: ## Push a catalog image.
$(MAKE) docker-push IMG=$(CATALOG_IMG)

# Generate the full set of manifests to deploy the TrustyAI operator, with a customizable deployment namespace and operator image
OPERATOR_IMAGE ?= quay.io/trustyai/trustyai-service-operator:latest
.PHONY: manifest-gen
manifest-gen: kustomize
@echo "Usage: make manifest-gen NAMESPACE=<namespace> OPERATOR_IMAGE=<image>"
@echo "Example: make manifest-gen NAMESPACE=my-namespace OPERATOR_IMAGE=quay.io/myorg/trustyai-service-operator:latest"
mkdir -p release
@if [ -z "$(NAMESPACE)" ]; then echo "Error: NAMESPACE argument is required"; exit 1; fi
$(KUSTOMIZE) build config/base | sed "s|namespace: system|namespace: $(NAMESPACE)|g" | sed "s|quay.io/trustyai/trustyai-service-operator:latest|$(OPERATOR_IMAGE)|g" > release/trustyai_bundle.yaml
@echo "Release manifest generated at release/trustyai_bundle.yaml with namespace '$(NAMESPACE)' and operator image '$(OPERATOR_IMAGE)'"
172 changes: 21 additions & 151 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,174 +7,44 @@

## Overview

The TrustyAI Kubernetes Operator aims at simplifying the deployment and management of the [TrustyAI service](https://github.com/trustyai-explainability/trustyai-explainability/tree/main/explainability-service) on Kubernetes and OpenShift clusters by watching for custom resources of kind `TrustyAIService` in the `trustyai.opendatahub.io` API group and manages deployments, services, and optionally, routes and `ServiceMonitors` corresponding to these resources.

The operator ensures the service is properly configured, is discoverable by Prometheus for metrics scraping (on both Kubernetes and OpenShift), and is accessible via a Route on OpenShift.
The TrustyAI Kubernetes Operator aims at simplifying the deployment and management of various TrustyAI Kubernetes components, such as:
- [TrustyAI Service](https://github.com/trustyai-explainability/trustyai-explainability): A service that deploys alongside KServe models and collects
inference data to enable model explainability, fairness monitoring, and drift tracking.
- [FMS-Guardrails](https://github.com/foundation-model-stack/fms-guardrails-orchestrator): A modular framework for guardrailing LLMs
- [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness/tree/main): A job-based architecture for deploying and managing LLM evaluations, based on EleutherAI's lm-evaluation-harness library.

## Prerequisites

- Kubernetes cluster v1.19+ or OpenShift cluster v4.6+
- `kubectl` v1.19+ or `oc` client v4.6+

## Installation using pre-built Operator image
- `kustomize` v5+
## Installation

This operator is available as an [image on Quay.io](https://quay.io/repository/trustyai/trustyai-service-operator?tab=history).
To deploy it on your cluster:

1. **Install the Custom Resource Definition (CRD):**

Apply the CRD to your cluster (replace the URL with the relevant one, if using another repository):

```bash
kubectl apply -f https://raw.githubusercontent.com/trustyai-explainability/trustyai-service-operator/main/config/crd/bases/trustyai.opendatahub.io_trustyaiservices.yaml
```

2. **Deploy the Operator:**

Apply the following Kubernetes manifest to deploy the operator:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: trustyai-operator
namespace: trustyai-operator-system
spec:
replicas: 1
selector:
matchLabels:
control-plane: trustyai-operator
template:
metadata:
labels:
control-plane: trustyai-operator
spec:
containers:
- name: trustyai-operator
image: quay.io/trustyai/trustyai-service-operator:latest
command:
- /manager
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 20Mi
```

or run

```shell
kubectl apply -f https://raw.githubusercontent.com/trustyai-explainability/trustyai-service-operator/main/artifacts/examples/deploy-operator.yaml
```

## Usage

Once the operator is installed, you can create `TrustyAIService` resources, and the operator will create corresponding TrustyAI deployments, services, and (on OpenShift) routes.

Here's an example `TrustyAIService` manifest:

```yaml
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: TrustyAIService
metadata:
name: trustyai-service-example
spec:
storage:
format: "PVC"
folder: "/inputs"
size: "1Gi"
data:
filename: "data.csv"
format: "CSV"
metrics:
schedule: "5s"
batchSize: 5000 # Optional, defaults to 5000
```

You can apply this manifest with

```shell
kubectl apply -f <file-name.yaml> -n $NAMESPACE
OPERATOR_NAMESPACE=opendatahub
make manifest-gen NAMESPACE=$OPERATOR_NAMESPACE KUSTOMIZE=kustomize
oc apply -f release/trustyai_bundle.yaml -n $OPERATOR_NAMESPACE
```
to create a service, where `$NAMESPACE` is the namespace where you want to deploy it.
You can also build your own image, and use that as your TrustyAI operator:


Additionally, in that namespace:

* a `ServiceMonitor` will be created to allow Prometheus to scrape metrics from the service.
* (if on OpenShift) a `Route` will be created to allow external access to the service.

### Custom Image Configuration using ConfigMap
You can specify a custom TrustyAI-service image via adding parameters to the TrustyAI-Operator KFDef, for example:

```yaml
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
name: trustyai-service-operator
namespace: opendatahub
spec:
applications:
- kustomizeConfig:
repoRef:
name: manifests
path: config
parameters:
- name: trustyaiServiceImage
value: NEW_IMAGE_NAME
name: trustyai-service-operator
repos:
- name: manifests
uri: https://github.com/trustyai-explainability/trustyai-service-operator/tarball/main
version: v1.0.0
```shell
OPERATOR_NAMESPACE=opendatahub
OPERATOR_IMAGE=quay.io/yourorg/your-image-name:latest
podman build -t $OPERATOR_IMAGE --platform linux/amd64 -f Dockerfile .
podman push $OPERATOR_IMAGE
make manifest-gen NAMESPACE=$OPERATOR_NAMESPACE OPERATOR_IMAGE=$OPERATOR_IMAGE KUSTOMIZE=kustomize
oc apply -f release/trustyai_bundle.yaml -n $OPERATOR_NAMESPACE
```
If these parameters are unspecified, the [default image and tag](config/base/params.env) will be used.


If you'd like to change the service image/tag after deploying the operator, simply change the parameters in the KFDef. Any
TrustyAI service deployed subsequently will use the new image and tag.

### `TrustyAIService` Status Updates

The `TrustyAIService` custom resource tracks the availability of `InferenceServices` and `PersistentVolumeClaims (PVCs)`
through its `status` field. Below are the status types and reasons that are available:

#### `InferenceService` Status

| Status Type | Status Reason | Description |
|-------------------------------|-----------------------------------|-----------------------------------|
| `InferenceServicesPresent` | `InferenceServicesNotFound` | InferenceServices were not found. |
| `InferenceServicesPresent` | `InferenceServicesFound` | InferenceServices were found. |

#### `PersistentVolumeClaim` (PVCs) Status

| Status Type | Status Reason | Description |
|------------------|-----------------|------------------------------------|
| `PVCAvailable` | `PVCNotFound` | `PersistentVolumeClaim` not found. |
| `PVCAvailable` | `PVCFound` | `PersistentVolumeClaim` found. |

#### Database Status

| Status Type | Status Reason | Description |
|---------------|-------------------------|---------------------------------------------------|
| `DBAvailable` | `DBCredentialsNotFound` | Database credentials secret not found |
| `DBAvailable` | `DBCredentialsError` | Database credentials malformed (e.g. missing key) |
| `DBAvailable` | `DBConnectionError` | Service error connecting to the database |
| `DBAvailable` | `DBAvailable` | Successfully connected to the database |


#### Status Behavior

- If a PVC is not available, the `Ready` status of `TrustyAIService` will be set to `False`.
- If on database mode, any `DBAvailable` reason other than `DBAvailable` will set the `TrustyAIService` to `Not Ready`
- However, if `InferenceServices` are not found, the `Ready` status of `TrustyAIService` will not be affected, _i.e._, it is `Ready` by all other conditions, it will remain so.
## Usage
For usage information, please see the [OpenDataHub documentation of TrustyAI](https://opendatahub.io/docs/monitoring-data-science-models/#configuring-trustyai_monitor).

## Contributing

Please see the [CONTRIBUTING.md](./CONTRIBUTING.md) file for more details on how to contribute to this project.

## License

This project is licensed under the Apache License Version 2.0 - see the [LICENSE](./LICENSE) file for details.
This project is licensed under the Apache License Version 2.0 - see the [LICENSE](./LICENSE) file for details.
38 changes: 38 additions & 0 deletions api/common/condition.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
package common

import (
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// +kubebuilder:object:generate=true
type Condition struct {
Type string `json:"type" description:"type of condition ie. Available|Progressing|Degraded."`

Status corev1.ConditionStatus `json:"status" description:"status of the condition, one of True, False, Unknown"`

// +optional
Reason string `json:"reason,omitempty" description:"one-word CamelCase reason for the condition's last transition"`

// +optional
Message string `json:"message,omitempty" description:"human-readable message indicating details about last transition"`

// +optional
LastTransitionTime metav1.Time `json:"lastTransitionTime" description:"last time the condition transit from one status to another"`
}

// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
func (in *Condition) DeepCopyInto(out *Condition) {
*out = *in
in.LastTransitionTime.DeepCopyInto(&out.LastTransitionTime)
}

// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new Condition.
func (in *Condition) DeepCopy() *Condition {
if in == nil {
return nil
}
out := new(Condition)
in.DeepCopyInto(out)
return out
}
25 changes: 5 additions & 20 deletions api/gorch/v1alpha1/guardrailsorchestrator_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ limitations under the License.
package v1alpha1

import (
"github.com/trustyai-explainability/trustyai-service-operator/api/common"
corev1 "k8s.io/api/core/v1"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

Expand All @@ -43,7 +43,7 @@ type GuardrailsOrchestratorSpec struct {
// Important: Run "make" to regenerate code after modifying this file
// Number of replicas
Replicas int32 `json:"replicas"`
// Name of configmap containing generator,detector,and chunker arguments
// Name of configmap containing generator, detector, and chunker arguments
// +optional
OrchestratorConfig *string `json:"orchestratorConfig,omitempty"`
// Settings governing the automatic configuration of the orchestrator. Replaces `OrchestratorConfig`.
Expand All @@ -70,6 +70,8 @@ type GuardrailsOrchestratorSpec struct {
// Define TLS secrets to be mounted to the orchestrator. Secrets will be mounted at /etc/tls/$SECRET_NAME
// +optional
TLSSecrets *[]string `json:"tlsSecrets,omitempty"`
// Define environment variables. These will be added to the orchestrator, gateway, and built-in detector pods.
EnvVars *[]corev1.EnvVar `json:"env,omitempty"`
}

// OTelExporter defines the environment variables for configuring the OTLP exporter.
Expand All @@ -92,23 +94,6 @@ type OTelExporter struct {
EnableMetrics bool `json:"enableMetrics,omitempty"`
}

type ConditionType string

type Condition struct {
Type ConditionType `json:"type" description:"type of condition ie. Available|Progressing|Degraded."`

Status corev1.ConditionStatus `json:"status" description:"status of the condition, one of True, False, Unknown"`

// +optional
Reason string `json:"reason,omitempty" description:"one-word CamelCase reason for the condition's last transition"`

// +optional
Message string `json:"message,omitempty" description:"human-readable message indicating details about last transition"`

// +optional
LastTransitionTime metav1.Time `json:"lastTransitionTime" description:"last time the condition transit from one status to another"`
}

type DetectedService struct {
Name string `json:"name,omitempty"`
Type string `json:"type,omitempty"` // e.g. "generator" or "detector"
Expand All @@ -133,7 +118,7 @@ type GuardrailsOrchestratorStatus struct {
Phase string `json:"phase,omitempty"`
// Conditions describes the state of the GuardrailsOrchestrator resource.
// +optional
Conditions []Condition `json:"conditions,omitempty"`
Conditions []common.Condition `json:"conditions,omitempty"`
// AutoConfigState describes information about the generated autoconfiguration
// +optional
AutoConfigState *AutoConfigState `json:"autoConfigState,omitempty"`
Expand Down
31 changes: 14 additions & 17 deletions api/gorch/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading