Skip to content

Commit

Permalink
Vertex AI Endpoint Stress Tester (#1336)
Browse files Browse the repository at this point in the history
* Updated test files names for issue #1169

* Vertex AI Endpoint Stress Tester utility: First push

* Updated the vegata script as per Trigger build errors

* Further fixes for build failures

* Further fixes for build failures

* Updated README.md

---------

Co-authored-by: Andrew Gold <[email protected]>
  • Loading branch information
suddhasatwabhaumik and agold-rh authored Aug 16, 2024
1 parent 91bb52d commit 8c000a0
Show file tree
Hide file tree
Showing 11 changed files with 673 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -564,6 +564,10 @@ Platform usage.
* [STS Job Manager](tools/sts-job-manager/) - A petabyte-scale bucket
migration tool utilizing
[Storage Transfer Service](https://cloud.google.com/storage-transfer-service)
* [Vertex AI Endpoint Tester] (tools/vertex-ai-endpoint-load-tester) - This
utility helps to methodically test variety of Vertex AI Endpoints by their
sizes so that one can decide the right size to deploy an ML Model on Vertex
AI given a sample request JSON and some idea(s) on expected queries per second.
* [VM Migrator](tools/vm-migrator) - This utility automates migrating Virtual
Machine instances within GCP. You can migrate VM's from one zone to another
zone/region within the same project or different projects while retaining
Expand Down
79 changes: 79 additions & 0 deletions tools/vertex-ai-endpoint-load-tester/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
```
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

# Vertex AI Endpoint Stress Tester

go/vertex-endpoint-stress-tester

## Introduction

Vertex AI Endpoints are a great managed solution to deploy ML models at scale. By their architecture, the Vertex AI Endpoints use GKE or similar infrastructure components in the background to enable seamless deployment and inference capabilities for any ML model, be it AutoML or Custom ones.

In some of our recent engagements, we have seen questions or queries raised about the scalability perspective of Vertex AI Endpoints. There is this sample notebook available in GitHub under the Google Cloud Platform account, which explains one of the many ways to check how much load a particular instance handles. However, it is not an automated solution which anyone from GCC can use with ease. Also, it involves some tedious and manual activities as well of creating and deleting endpoints and deploying ML models on them to test the load that specific type of VM can handle. In lieu of the fact that Vertex AI endpoint service continues to grow and supports variety of instance types, this procedure requires an improvement, so that it is easy for anyone from GCC to deploy a given ML model on a series of endpoints of various sizes and check which one is more suitable for the given workload, with some estimations about how much traffic this particular ML model will or is supposed to receive once it goes to Production.

This is where we propose our automated tool (proposed to be open sourced in the PSO GitHub and KitHub), the objective of which is to automatically perform stress testing for one particular model over various types of Endpoint configurations with and without autoscaling, so that we have data driven approach to decide the right sizing of the endpoint.

## Assumptions

1. That the ML model is already built, which this automation tool will not train, but will simply refer from BQML or Vertex AI model registry.
2. That the deployed ML model can accept a valid JSON request as input and provide online predictions as an output, preferably JSON.
3. That the user of this utility has at least an example JSON request file, put into the [requests](requests/) folder. Please see the existing [example](requests/request_movie.json) for clarity.

## How to Install & Run?

Out of the box, the utility can be run from the command line, so the best way to try it for the first time, is to:

1. Edit the [config](config/config.ini) file and select only 1 or 2 VM types.
2. Place the request JSON file into the [requests](requests/) folder. Please see the existing [example](requests/request_movie.json) for reference.
3. Run the utility as follows:


```
cd vertex-ai-endpoint-load-tester/
gcloud auth login
gcloud config set project PROJECT_ID
python main.py
```

## Logging

When ran from the command line, all logs are printed on the console or STDOUT for user to validate. It is NOT stored anywhere else for historical references.
Hence we recommend installing this solution as a container on Cloud Run and run it as a Cloud Run service or job (as long as applicable) so that all logs can then be found from Cloud logging.

## Reporting/Analytics

TODO: This is an open feature, and will be added shortly.
The idea here is to utilize a Looker Studio dashboard to visualize the results of the load testing, so that it is easily consumable by anyone!

## Troubleshooting

1. Check for requisite IAM permissions of the user or Service account on Cloud run (for example) who is running the job.
2. Ensure the [config](config/config.ini) file has no typo or additional information.
3. Ensure from Logs if there are any specific errors captured to debug further.

## Known Errors

TODO

## Roadmap

In future, we can aim to extend this utility for LLMs or any other types of ML models.
Further, we can also extend the same feature to load test other services in GCP, like GKE, which are frequently used to deploy ML solutions.

## Authors:

Ajit Sonawane - AI Engineer, Google Cloud
Suddhasatwa Bhaumik - AI Engineer, Google Cloud
64 changes: 64 additions & 0 deletions tools/vertex-ai-endpoint-load-tester/config/config.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Input configurations
[config]

# logging level
log_level = INFO

# deployed model ID
MODEL_ID = 888526341522063360

# the QPS rates to try
RATE = [25, 50]

# duration for which tests will be ran
DURATION = 10

# BigQuery table to store results
OUTPUT_BQ_TBL_ID = load_test_dataset.test9

# project ID
PROJECT = rare-signer-355918

# region
LOCATION = us-central1

# amount of sleep time before
# the endpoint is tested after
# the model is deployed
TIMEOUT = 300

# autoscaling details.
MIN_NODES = 1
MAX_NODES = 2

# Types of machines to
# be used during testing
# needs to be a list of all VM
MACHINE_TYPES_LST = n1-standard-4,n1-standard-8

#name of request body file in requests folder for making post call to stress testing API
#Please do not enclosed file names with quotes
REQUEST_FILE = request_movie.json

# , "n1-standard-32", "n1-standard-64"]

# "n1-standard-4", "n1-standard-8", "n1-standard-16", "n1-standard-32",
# "n1-highmem-2", "n1-highmem-4", "n1-highmem-8", "n1-highmem-16", "n1-highmem-32",
# "n1-highcpu-2", "n1-highcpu-4", "n1-highcpu-8", "n1-highcpu-16", "n1-highcpu-32",
# "c3-standard-4", "c3-standard-8", "c3-standard-22", "c3-standard-44", "c3-standard-88", "c3-standard-176"]

# End.
Binary file not shown.
211 changes: 211 additions & 0 deletions tools/vertex-ai-endpoint-load-tester/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#
# Script deploys vertex AI endpoint
# and Capture endpoint performance to BQ
#
# Authors: ajitsonawane@,suddhasatwa@
# Team: Google Cloud Consulting
# Date: 25.01.2024

# Imports
import sys
import logging
import traceback
import uuid
import time
import json
from google.cloud import aiplatform

from utils import utils
# from utils import config_parser as cfp
# from utils.utils import register_latency
# from utils.utils import log_latencies_to_bq
# from utils.utils import write_results_to_bq

# function to process requests to endpoint.
def process(machine_type: str, latencies: list, log_level: str):
"""
Deploys machine based on user input, creates endpoint and measure latencies.
Takes the latencies List as input.
Calls the Vegata utility to update latencies for each machine type.
Passes it to another utility to generate full Results.
Returns the Results back.
Inputs:
machine_type: each type of machine to be tested.
latencies: list (usually empty) to get results from Vegata
log_level: level of logging.
Outputs:
results: Combined results for each machine type.
"""

# set logging setup
logging.basicConfig(level=log_level, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")

# start logging.
logging.info("Reading configuration.")

# read config.
config_data = utils.read_config("config/config.ini")
MODEL_ID = config_data["config"]["model_id"] # model ID
RATE = json.loads(config_data["config"]["rate"]) # the QPS rates to try
DURATION = str(config_data["config"]["duration"]) # duration for which tests will be ran
PROJECT = config_data["config"]["project"] # project ID
LOCATION = config_data["config"]["location"] # region
TIMEOUT = config_data["config"]["timeout"] # endpoint timeout
MIN_NODES = int(config_data["config"]["min_nodes"]) # min nodes for scaling
MAX_NODES = int(config_data["config"]["max_nodes"]) #max nodes for scaling
REQUEST_FILE = str(config_data["config"]["request_file"])

# deploy model on endpoint.
logging.info(
"Deploying endpoint on machine: %s for model: %s", machine_type, MODEL_ID)
try:
# create client for Vertex AI.
logging.info("Creating AI Platform object.")
aiplatform.init(project=PROJECT, location=LOCATION)

# load the model from registry.
logging.info("Loading {} from Model registry.".format(MODEL_ID))
model = aiplatform.Model(model_name=MODEL_ID)

# generate random UUID
logging.info("Generating random UUID for endpoint creation.")
ep_uuid = uuid.uuid4().hex
display_name = f"ep_{machine_type}_{ep_uuid}"

# create endpoint instance
logging.info("Creating endpoint instance.")
endpoint = aiplatform.Endpoint.create(display_name=display_name)

# deploy endpoint on specific machine type
logging.info("Deploying model {} on endpoint {}".format(model, display_name))
endpoint.deploy(model, min_replica_count=MIN_NODES,
max_replica_count=MAX_NODES, machine_type=machine_type)

# Sleep for 5 minutes
# general best practice with Vertex AI Endpoints
logging.info("Sleeping for 5 minutes, for the endpoint to be ready!")
time.sleep(TIMEOUT)

# Register latencies for predictions
logging.info("Calling utility to register the latencies.")
ret_code, latencies = utils.register_latencies(RATE, DURATION, endpoint, machine_type, endpoint.display_name, latencies, REQUEST_FILE, log_level)
if ret_code == 1:
logging.info("Latencies recorded for {}".format(machine_type))
else:
logging.error("Error in recording latencies for {}".format(machine_type))
sys.exit(1)

# preprocess registered latencies
logging.info("Calling utility to prepare latencies for BigQuery.")
results = utils.log_latencies_to_bq(MODEL_ID, latencies, log_level)
if results:
logging.info("Latencies information processed successfully.")
else:
logging.error("Error in recording all latencies. Exiting.")
sys.exit(1)

# Un-deploy endpoint
logging.info("Un-deploying endpoint: %s", endpoint.resource_name)
endpoint.undeploy_all()

# Deleting endpoint
logging.info("Deleting endpoint: %s", endpoint.resource_name)
endpoint.delete()

logging.info("Processing completed for machine: %s", machine_type)

except Exception as ex:
logging.error(''.join(traceback.format_exception(etype=type(ex),
value=ex, tb=ex.__traceback__)))

# return results.
return (results)

# entrypoint function.
def main():
""" Entrypoint """

# Read config.
config_data = utils.read_config("config/config.ini")
MACHINE_TYPES_LST = config_data["config"]["machine_types_lst"].split(',') # List of machine types
LOG_LEVEL = config_data["config"]["log_level"] # level of logging.
OUTPUT_BQ_TBL_ID = config_data["config"]["output_bq_tbl_id"] # BigQuery table to store results
PROJECT = config_data["config"]["project"] # project ID

# log setup.
logging.basicConfig(level=LOG_LEVEL, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")

# start logging.
logging.info("Vertex Endpoint Stress Tester Utility.")

# variables
logging.info("Prepping local variables.")
LATENCIES = []
RESULTS = []

# record start time.
start = time.time()

# loop through each machine type
# and process the records.
try:
for machine_type in MACHINE_TYPES_LST:
# log calling the utility
logging.info("Calling data processing utility.")

# append the results from utility
RESULTS.extend(process(machine_type, LATENCIES, LOG_LEVEL))

# log end.
logging.info("Results utility completed.")

# reset the latencies variable
LATENCIES = []
except Exception as e:
# log error
logging.error("Got error while running load tests.")
logging.error(e)
# exit
sys.exit(1)

# REMOVE
logging.info(len(LATENCIES))
logging.info(len(RESULTS))

# write collected results to BigQuery
logging.info(" Writing data of load testing on machine type %s", machine_type)
bq_write_ret_code = utils.write_results_to_bq(RESULTS, OUTPUT_BQ_TBL_ID, PROJECT, LOG_LEVEL)
if bq_write_ret_code == 1:
# log success
logging.info("Successfully written data into BQ in {} table.".format(OUTPUT_BQ_TBL_ID))
else:
# log error
logging.error("Errors in writing data into BigQuery. Exiting.")
# exit
sys.exit(1)

# print the total time taken.
# this is for all machines.
logging.info(f"Total time taken for execution {time.time()-start}")

# Call entrypoint
if __name__ == "__main__":
main()

# End.
20 changes: 20 additions & 0 deletions tools/vertex-ai-endpoint-load-tester/requests/request_movie.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"instances": [
{
"Id": 3837,
"name": "The",
"rating": "R",
"genre": "Comedy",
"year": 2000,
"released": "8/3/2001",
"director": "John",
"writer": "John",
"star": "Michael",
"country": "United",
"budget": 35524924.14,
"company": "Pictures",
"runtime": 104,
"data_cat": "TRAIN"
}
]
}
Loading

0 comments on commit 8c000a0

Please sign in to comment.