Sanford

Addenda

Development profile

To activate development mode include dev in comma-separated list of Spring profiles, e.g.

-Dspring.profiles.active=docker,ollama,pgvector,dev

See spring.config.activate.on-profile=dev in application.yml.

Disable tracing

To disable tracing, set an environment variable before starting the application.

export MANAGEMENT_TRACING_ENABLED=false

On Cloud Foundry, you would

cf set-env sanford MANAGEMENT_TRACING_ENABLED false
cf restage sanford

Choosing models from Huggingface to run on Ollama

Models must be stored in GPT-Generated Unified Format (GGUF)

Chat (4-bit precision) - most downloads, recently updated
Text Embedding (4-bit precision) - most downloads, recently updated

Prefix all models you pull with...

ollama pull hf.co/

Recommended on-platform model combo

When serving models from Cloud Foundry with the GenAI tile

CPU-only configuration

Choose compute type that has a minimum of 8-vCPU, 64Gb RAM, and 80Gb disk
- when targeting CF environment provisioned on Google Cloud, choose c2d-highmem-8
Choose among [ wizardlm2, qwen2.5:3b, mistral, gemma2 ] for the chat model
Choose among [ all-minilm:33m, nomic-embed-text, aroxima/gte-qwen2-1.5b-instruct ] for the embedding model
- the above-mentioned embedding models have dimensions set respectively to: 384, 768, 1536
Choose Postgres for the vector store provider

E.g., if you're employing the deploy-on-tp4cf.sh script, edit the following variables to be

GENAI_CHAT_PLAN_NAME=qwen2.5:3b
GENAI_EMBEDDINGS_PLAN_NAME=aroxima/gte-qwen2-1.5b-instruct

and add the following to the sequence of cf set-env statements

export SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=1536

Recommended Ollama model combo

When serving models from Ollama, you're encouraged to consult then leverage one of the provisioning scripts targeting a public cloud infrastructure provider:

AWS
- Before executing this script you'll need to export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. If you authenticate via a secure token service, then you'll also need to export AWS_SESSION_TOKEN.
Azure
- Before executing this script you'll need to export ARM_SUBSCRIPTION_ID, ARM_TENANT_ID, ARM_CLIENT_ID, and ARM_CLIENT_SECRET.
Google Cloud
- Before executing this script you'll need to execute gcloud auth application-default login.

CPU-only configuration

Choose a compute type that has a minimum of 8-vCPU, 64Gb RAM, and 80Gb disk
- when targeting an Ollama VM installation hosted on
  - AWS, choose m6i.4xlarge
  - Azure, choose Standard_D16s_v4
  - Google Cloud, choose c2d-highmem-8

GPU assisted configuration

Choose a compute type that has a minimum of 16-vCPU, 64Gb RAM, and 80Gb disk

Here's what you need to know about each cloud provider's GPU configuration:

AWS
- GPU instances have specific instance types (p3, g4dn, p4d families)
- Requires NVIDIA drivers installation
- Example configuration:
```
GPU_INSTANCE_TYPE="g4dn.4xlarge"
USE_GPU=true
```
Azure
- GPU VMs use specific VM sizes (NC, ND series)
- Requires NVIDIA drivers installation
- Example configuration:
```
GPU_VM_SIZE="Standard_NC12s_v3"
USE_GPU=true
```
Google Cloud
- Common GPU types: nvidia-tesla-t4, nvidia-tesla-p100, nvidia-tesla-v100
- GPU-enabled zones may be limited
- Requires special image family for GPU support
- Example configuration:
```
GPU_TYPE="nvidia-tesla-t4"
GPU_COUNT=1
```

Important considerations:

GPU instances are significantly more expensive than regular instances
Not all regions/zones support GPU instances
You may need to request quota increases for GPU instances
Some GPU types require specific machine types/sizes
Driver installation may take several minutes during instance startup

Getting started

Here's how to get going running locally targeting models hosted on a VM in a public cloud

# Checkout source
gh repo clone cf-toolsuite/sanford
cd sanford
# Run provisioning script to create and start a VM with Ollama hosted in [ aws|azure|googlecloud ]
./provision-ollama-vm-on-{replace_with_available_public_cloud_variant}.sh create
# Set environment variables (override defaults)
export CHAT_MODEL=wizardlm2
export EMBEDDING_MODEL=all-minilm:33m
export SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=384
export OLLAMA_BASE_URL=http://{replace_with_ip_address_of_ollama_instance}:11434
gradle clean build bootRun -Pvector-db-provider=pgvector -Pmodel-api-provider=ollama -Dspring.profiles.active=docker,ollama,pgvector,dev
time http --verify=no POST :8080/api/fetch urls:='["https://www.govtrack.us/api/v2/role?current=true&role_type=senator"]'  
time http GET 'http://localhost:8080/api/chat?q="Who are the US senators from Washington?"&f[state]="WA"&f[gender]="female"'

Activate Arize Phoenix for tracing and evaluation

Activate the arize-phoenix Spring profile in addition to the docker Spring profile.

You may do that by adding it as a profile in the comma-separated list of profiles using

a command-line runtime argument, -Dspring.profiles.active=
an environment variable, export SPRING_PROFILES_ACTIVE=

After launching the application and making a request, visit http://localhost:6006.

The runtime configuration may be adapted to work without the docker Spring profile activated. Consult Arize Phoenix's self-hosting deployment documentation and the ARIZE_PHOENIX_BASE_URL environment variable in application.yml.

Serving models on Kubernetes clusters

Take a look at Kserve. Then consult this quick-start guide to host models on your workstation or laptop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADDENDA.md

ADDENDA.md

Sanford

Addenda

Development profile

Disable tracing

Choosing models from Huggingface to run on Ollama

Recommended on-platform model combo

CPU-only configuration

Recommended Ollama model combo

CPU-only configuration

GPU assisted configuration

Getting started

Activate Arize Phoenix for tracing and evaluation

Serving models on Kubernetes clusters

Files

ADDENDA.md

Latest commit

History

ADDENDA.md

File metadata and controls

Sanford

Addenda

Development profile

Disable tracing

Choosing models from Huggingface to run on Ollama

Recommended on-platform model combo

CPU-only configuration

Recommended Ollama model combo

CPU-only configuration

GPU assisted configuration

Getting started

Activate Arize Phoenix for tracing and evaluation

Serving models on Kubernetes clusters