To activate development mode include dev
in comma-separated list of Spring profiles, e.g.
-Dspring.profiles.active=docker,ollama,pgvector,dev
See spring.config.activate.on-profile=dev in application.yml.
To disable tracing, set an environment variable before starting the application.
export MANAGEMENT_TRACING_ENABLED=false
On Cloud Foundry, you would
cf set-env sanford MANAGEMENT_TRACING_ENABLED false
cf restage sanford
Models must be stored in GPT-Generated Unified Format (GGUF)
- Chat (4-bit precision) - most downloads, recently updated
- Text Embedding (4-bit precision) - most downloads, recently updated
Prefix all models you pull with...
ollama pull hf.co/
When serving models from Cloud Foundry with the GenAI tile
- Choose compute type that has a minimum of 8-vCPU, 64Gb RAM, and 80Gb disk
- when targeting CF environment provisioned on Google Cloud, choose c2d-highmem-8
- Choose among [
wizardlm2
,qwen2.5:3b
,mistral
,gemma2
] for the chat model - Choose among [
all-minilm:33m
,nomic-embed-text
,aroxima/gte-qwen2-1.5b-instruct
] for the embedding model- the above-mentioned embedding models have dimensions set respectively to:
384
,768
,1536
- the above-mentioned embedding models have dimensions set respectively to:
- Choose Postgres for the vector store provider
E.g., if you're employing the deploy-on-tp4cf.sh script, edit the following variables to be
GENAI_CHAT_PLAN_NAME=qwen2.5:3b
GENAI_EMBEDDINGS_PLAN_NAME=aroxima/gte-qwen2-1.5b-instruct
and add the following to the sequence of cf set-env
statements
export SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=1536
When serving models from Ollama, you're encouraged to consult then leverage one of the provisioning scripts targeting a public cloud infrastructure provider:
- AWS
- Before executing this script you'll need to export
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
. If you authenticate via a secure token service, then you'll also need to exportAWS_SESSION_TOKEN
.
- Before executing this script you'll need to export
- Azure
- Before executing this script you'll need to export
ARM_SUBSCRIPTION_ID
,ARM_TENANT_ID
,ARM_CLIENT_ID
, andARM_CLIENT_SECRET
.
- Before executing this script you'll need to export
- Google Cloud
- Before executing this script you'll need to execute
gcloud auth application-default login
.
- Before executing this script you'll need to execute
- Choose a compute type that has a minimum of 8-vCPU, 64Gb RAM, and 80Gb disk
- when targeting an Ollama VM installation hosted on
- AWS, choose m6i.4xlarge
- Azure, choose Standard_D16s_v4
- Google Cloud, choose c2d-highmem-8
- when targeting an Ollama VM installation hosted on
- Choose a compute type that has a minimum of 16-vCPU, 64Gb RAM, and 80Gb disk
Here's what you need to know about each cloud provider's GPU configuration:
-
AWS
-
GPU instances have specific instance types (
p3
,g4dn
,p4d
families) -
Requires NVIDIA drivers installation
-
Example configuration:
GPU_INSTANCE_TYPE="g4dn.4xlarge" USE_GPU=true
-
-
Azure
-
GPU VMs use specific VM sizes (
NC
,ND
series) -
Requires NVIDIA drivers installation
-
Example configuration:
GPU_VM_SIZE="Standard_NC12s_v3" USE_GPU=true
-
-
Google Cloud
-
Common GPU types:
nvidia-tesla-t4
,nvidia-tesla-p100
,nvidia-tesla-v100
-
GPU-enabled zones may be limited
-
Requires special image family for GPU support
-
Example configuration:
GPU_TYPE="nvidia-tesla-t4" GPU_COUNT=1
-
Important considerations:
- GPU instances are significantly more expensive than regular instances
- Not all regions/zones support GPU instances
- You may need to request quota increases for GPU instances
- Some GPU types require specific machine types/sizes
- Driver installation may take several minutes during instance startup
Here's how to get going running locally targeting models hosted on a VM in a public cloud
# Checkout source
gh repo clone cf-toolsuite/sanford
cd sanford
# Run provisioning script to create and start a VM with Ollama hosted in [ aws|azure|googlecloud ]
./provision-ollama-vm-on-{replace_with_available_public_cloud_variant}.sh create
# Set environment variables (override defaults)
export CHAT_MODEL=wizardlm2
export EMBEDDING_MODEL=all-minilm:33m
export SPRING_AI_VECTORSTORE_PGVECTOR_DIMENSIONS=384
export OLLAMA_BASE_URL=http://{replace_with_ip_address_of_ollama_instance}:11434
gradle clean build bootRun -Pvector-db-provider=pgvector -Pmodel-api-provider=ollama -Dspring.profiles.active=docker,ollama,pgvector,dev
time http --verify=no POST :8080/api/fetch urls:='["https://www.govtrack.us/api/v2/role?current=true&role_type=senator"]'
time http GET 'http://localhost:8080/api/chat?q="Who are the US senators from Washington?"&f[state]="WA"&f[gender]="female"'
Activate the arize-phoenix
Spring profile in addition to the docker
Spring profile.
You may do that by adding it as a profile in the comma-separated list of profiles using
- a command-line runtime argument,
-Dspring.profiles.active=
- an environment variable,
export SPRING_PROFILES_ACTIVE=
After launching the application and making a request, visit http://localhost:6006.
The runtime configuration may be adapted to work without the
docker
Spring profile activated. Consult Arize Phoenix's self-hosting deployment documentation and theARIZE_PHOENIX_BASE_URL
environment variable in application.yml.