Vertex AI Endpoint Stress Tester (#1336)

* Updated test files names for issue #1169 * Vertex AI Endpoint Stress Tester utility: First push * Updated the vegata script as per Trigger build errors * Further fixes for build failures * Further fixes for build failures * Updated README.md --------- Co-authored-by: Andrew Gold <[email protected]>
GoogleCloudPlatform · Aug 16, 2024 · 8c000a0 · 8c000a0
1 parent 91bb52d
commit 8c000a0
Show file tree

Hide file tree

Showing 11 changed files with 673 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -564,6 +564,10 @@ Platform usage.
 *   [STS Job Manager](tools/sts-job-manager/) - A petabyte-scale bucket
     migration tool utilizing
     [Storage Transfer Service](https://cloud.google.com/storage-transfer-service)
+*   [Vertex AI Endpoint Tester] (tools/vertex-ai-endpoint-load-tester) - This 
+    utility helps to methodically test variety of Vertex AI Endpoints by their
+    sizes so that one can decide the right size to deploy an ML Model on Vertex
+    AI given a sample request JSON and some idea(s) on expected queries per second.
 *   [VM Migrator](tools/vm-migrator) - This utility automates migrating Virtual
     Machine instances within GCP. You can migrate VM's from one zone to another
     zone/region within the same project or different projects while retaining

diff --git a/tools/vertex-ai-endpoint-load-tester/README.md b/tools/vertex-ai-endpoint-load-tester/README.md
@@ -0,0 +1,79 @@
+```
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+```
+
+# Vertex AI Endpoint Stress Tester
+
+go/vertex-endpoint-stress-tester
+
+## Introduction
+
+Vertex AI Endpoints are a great managed solution to deploy ML models at scale. By their architecture, the Vertex AI Endpoints use GKE or similar infrastructure components in the background to enable seamless deployment and inference capabilities for any ML model, be it AutoML or Custom ones.
+
+In some of our recent engagements, we have seen questions or queries raised about the scalability perspective of Vertex AI Endpoints. There is this sample notebook available in GitHub under the Google Cloud Platform account, which explains one of the many ways to check how much load a particular instance handles. However, it is not an automated solution which anyone from GCC can use with ease. Also, it involves some tedious and manual activities as well of creating and deleting endpoints and deploying ML models on them to test the load that specific type of VM can handle. In lieu of the fact that Vertex AI endpoint service continues to grow and supports variety of instance types, this procedure requires an improvement, so that it is easy for anyone from GCC to deploy a given ML model on a series of endpoints of various sizes and check which one is more suitable for the given workload, with some estimations about how much traffic this particular ML model will or is supposed to receive once it goes to Production.
+
+This is where we propose our automated tool (proposed to be open sourced in the PSO GitHub and KitHub), the objective of which is to automatically perform stress testing for one particular model over various types of Endpoint configurations with and without autoscaling, so that we have data driven approach to decide the right sizing of the endpoint.
+
+## Assumptions
+
+1. That the ML model is already built, which this automation tool will not train, but will simply refer from BQML or Vertex AI model registry.
+2. That the deployed ML model can accept a valid JSON request as input and provide online predictions as an output, preferably JSON.
+3. That the user of this utility has at least an example JSON request file, put into the [requests](requests/) folder. Please see the existing [example](requests/request_movie.json) for clarity.
+
+## How to Install & Run?
+
+Out of the box, the utility can be run from the command line, so the best way to try it for the first time, is to:
+
+1. Edit the [config](config/config.ini) file and select only 1 or 2 VM types.
+2. Place the request JSON file into the [requests](requests/) folder. Please see the existing [example](requests/request_movie.json) for reference.
+3. Run the utility as follows:
+
+
+```
+cd vertex-ai-endpoint-load-tester/
+gcloud auth login
+gcloud config set project PROJECT_ID 
+python main.py
+```
+
+## Logging
+
+When ran from the command line, all logs are printed on the console or STDOUT for user to validate. It is NOT stored anywhere else for historical references.
+Hence we recommend installing this solution as a container on Cloud Run and run it as a Cloud Run service or job (as long as applicable) so that all logs can then be found from Cloud logging.
+
+## Reporting/Analytics
+
+TODO: This is an open feature, and will be added shortly. 
+The idea here is to utilize a Looker Studio dashboard to visualize the results of the load testing, so that it is easily consumable by anyone!
+
+## Troubleshooting
+
+1. Check for requisite IAM permissions of the user or Service account on Cloud run (for example) who is running the job.
+2. Ensure the [config](config/config.ini) file has no typo or additional information.
+3. Ensure from Logs if there are any specific errors captured to debug further. 
+
+## Known Errors
+
+TODO
+
+## Roadmap
+
+In future, we can aim to extend this utility for LLMs or any other types of ML models. 
+Further, we can also extend the same feature to load test other services in GCP, like GKE, which are frequently used to deploy ML solutions.
+
+## Authors:
+
+Ajit Sonawane - AI Engineer, Google Cloud
+Suddhasatwa Bhaumik - AI Engineer, Google Cloud
diff --git a/tools/vertex-ai-endpoint-load-tester/config/config.ini b/tools/vertex-ai-endpoint-load-tester/config/config.ini
@@ -0,0 +1,64 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Input configurations
+[config]
+
+# logging level
+log_level = INFO
+
+# deployed model ID
+MODEL_ID = 888526341522063360
+
+# the QPS rates to try
+RATE = [25, 50] 
+
+# duration for which tests will be ran
+DURATION = 10
+
+# BigQuery table to store results
+OUTPUT_BQ_TBL_ID = load_test_dataset.test9
+
+# project ID 
+PROJECT = rare-signer-355918
+
+# region
+LOCATION = us-central1
+
+# amount of sleep time before
+# the endpoint is tested after
+# the model is deployed
+TIMEOUT = 300 
+
+# autoscaling details.
+MIN_NODES = 1
+MAX_NODES = 2
+
+# Types of machines to
+# be used during testing
+# needs to be a list of all VM
+MACHINE_TYPES_LST = n1-standard-4,n1-standard-8
+
+#name of request body file in requests folder for making post call to stress testing API
+#Please do not enclosed file names with quotes
+REQUEST_FILE = request_movie.json
+
+# , "n1-standard-32", "n1-standard-64"]
+
+# "n1-standard-4", "n1-standard-8", "n1-standard-16", "n1-standard-32",
+#                      "n1-highmem-2", "n1-highmem-4", "n1-highmem-8", "n1-highmem-16", "n1-highmem-32",
+#                      "n1-highcpu-2", "n1-highcpu-4", "n1-highcpu-8", "n1-highcpu-16", "n1-highcpu-32",
+#                      "c3-standard-4", "c3-standard-8", "c3-standard-22", "c3-standard-44", "c3-standard-88", "c3-standard-176"]
+
+# End.
diff --git a/tools/vertex-ai-endpoint-load-tester/extras/vegeta_12.8.4_linux_amd64.tar.gz b/tools/vertex-ai-endpoint-load-tester/extras/vegeta_12.8.4_linux_amd64.tar.gz
diff --git a/tools/vertex-ai-endpoint-load-tester/main.py b/tools/vertex-ai-endpoint-load-tester/main.py
@@ -0,0 +1,211 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#
+# Script deploys vertex AI endpoint
+# and Capture endpoint performance to BQ
+#
+# Authors: ajitsonawane@,suddhasatwa@
+# Team:    Google Cloud Consulting
+# Date:    25.01.2024
+
+# Imports
+import sys
+import logging
+import traceback
+import uuid
+import time
+import json
+from google.cloud import aiplatform
+
+from utils import utils
+# from utils import config_parser as cfp
+# from utils.utils import register_latency
+# from utils.utils import log_latencies_to_bq
+# from utils.utils import write_results_to_bq
+
+# function to process requests to endpoint.
+def process(machine_type: str, latencies: list, log_level: str):
+    """
+    Deploys machine based on user input, creates endpoint and measure latencies.
+    Takes the latencies List as input.
+    Calls the Vegata utility to update latencies for each machine type.
+    Passes it to another utility to generate full Results. 
+    Returns the Results back.
+
+    Inputs:
+        machine_type: each type of machine to be tested.
+        latencies: list (usually empty) to get results from Vegata
+        log_level: level of logging.
+
+    Outputs:
+        results: Combined results for each machine type.
+    """
+
+    # set logging setup
+    logging.basicConfig(level=log_level, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
+
+    # start logging.
+    logging.info("Reading configuration.")
+
+    # read config.
+    config_data = utils.read_config("config/config.ini")
+    MODEL_ID = config_data["config"]["model_id"] # model ID
+    RATE = json.loads(config_data["config"]["rate"]) # the QPS rates to try
+    DURATION = str(config_data["config"]["duration"]) # duration for which tests will be ran
+    PROJECT = config_data["config"]["project"] # project ID 
+    LOCATION = config_data["config"]["location"] # region
+    TIMEOUT = config_data["config"]["timeout"] # endpoint timeout
+    MIN_NODES = int(config_data["config"]["min_nodes"]) # min nodes for scaling
+    MAX_NODES = int(config_data["config"]["max_nodes"]) #max nodes for scaling
+    REQUEST_FILE = str(config_data["config"]["request_file"])
+
+    # deploy model on endpoint.   
+    logging.info(
+        "Deploying endpoint on machine: %s for model: %s", machine_type, MODEL_ID)
+    try:
+        # create client for Vertex AI. 
+        logging.info("Creating AI Platform object.")
+        aiplatform.init(project=PROJECT, location=LOCATION)
+
+        # load the model from registry.
+        logging.info("Loading {} from Model registry.".format(MODEL_ID))
+        model = aiplatform.Model(model_name=MODEL_ID)
+
+        # generate random UUID
+        logging.info("Generating random UUID for endpoint creation.")
+        ep_uuid = uuid.uuid4().hex
+        display_name = f"ep_{machine_type}_{ep_uuid}"
+
+        # create endpoint instance
+        logging.info("Creating endpoint instance.")
+        endpoint = aiplatform.Endpoint.create(display_name=display_name)
+
+        # deploy endpoint on specific machine type
+        logging.info("Deploying model {} on endpoint {}".format(model, display_name))
+        endpoint.deploy(model, min_replica_count=MIN_NODES,
+                        max_replica_count=MAX_NODES, machine_type=machine_type)
+
+        # Sleep for 5 minutes
+        # general best practice with Vertex AI Endpoints
+        logging.info("Sleeping for 5 minutes, for the endpoint to be ready!")
+        time.sleep(TIMEOUT)
+
+        # Register latencies for predictions
+        logging.info("Calling utility to register the latencies.")
+        ret_code, latencies = utils.register_latencies(RATE, DURATION, endpoint, machine_type, endpoint.display_name, latencies, REQUEST_FILE, log_level)
+        if ret_code == 1:
+            logging.info("Latencies recorded for {}".format(machine_type))
+        else:
+            logging.error("Error in recording latencies for {}".format(machine_type))
+            sys.exit(1)
+
+        # preprocess registered latencies
+        logging.info("Calling utility to prepare latencies for BigQuery.")
+        results = utils.log_latencies_to_bq(MODEL_ID, latencies, log_level)
+        if results:
+            logging.info("Latencies information processed successfully.")
+        else:
+            logging.error("Error in recording all latencies. Exiting.")
+            sys.exit(1)
+
+        # Un-deploy endpoint
+        logging.info("Un-deploying endpoint: %s", endpoint.resource_name)
+        endpoint.undeploy_all()
+
+        # Deleting endpoint
+        logging.info("Deleting endpoint: %s", endpoint.resource_name)
+        endpoint.delete()
+
+        logging.info("Processing completed for machine: %s", machine_type)
+
+    except Exception as ex:
+        logging.error(''.join(traceback.format_exception(etype=type(ex),
+                                                         value=ex, tb=ex.__traceback__)))
+
+    # return results. 
+    return (results)
+
+# entrypoint function.
+def main():
+    """ Entrypoint """
+
+    # Read config. 
+    config_data = utils.read_config("config/config.ini")
+    MACHINE_TYPES_LST = config_data["config"]["machine_types_lst"].split(',') # List of machine types
+    LOG_LEVEL = config_data["config"]["log_level"] # level of logging.
+    OUTPUT_BQ_TBL_ID = config_data["config"]["output_bq_tbl_id"] # BigQuery table to store results
+    PROJECT = config_data["config"]["project"] # project ID 
+
+    # log setup.
+    logging.basicConfig(level=LOG_LEVEL, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
+
+    # start logging.
+    logging.info("Vertex Endpoint Stress Tester Utility.")
+
+    # variables
+    logging.info("Prepping local variables.")
+    LATENCIES = []
+    RESULTS = []
+
+    # record start time.
+    start = time.time()
+
+    # loop through each machine type
+    # and process the records.
+    try:
+        for machine_type in MACHINE_TYPES_LST:
+            # log calling the utility
+            logging.info("Calling data processing utility.")
+
+            # append the results from utility
+            RESULTS.extend(process(machine_type, LATENCIES, LOG_LEVEL))
+
+            # log end. 
+            logging.info("Results utility completed.")
+
+            # reset the latencies variable
+            LATENCIES = []
+    except Exception as e:
+        # log error
+        logging.error("Got error while running load tests.")
+        logging.error(e)
+        # exit
+        sys.exit(1) 
+
+    # REMOVE
+    logging.info(len(LATENCIES))
+    logging.info(len(RESULTS))
+
+    # write collected results to BigQuery
+    logging.info(" Writing data of load testing on machine type %s", machine_type)
+    bq_write_ret_code = utils.write_results_to_bq(RESULTS, OUTPUT_BQ_TBL_ID, PROJECT, LOG_LEVEL)
+    if bq_write_ret_code == 1:
+        # log success
+        logging.info("Successfully written data into BQ in {} table.".format(OUTPUT_BQ_TBL_ID))
+    else:
+        # log error
+        logging.error("Errors in writing data into BigQuery. Exiting.")
+        # exit
+        sys.exit(1)
+
+    # print the total time taken.
+    # this is for all machines.
+    logging.info(f"Total time taken for execution {time.time()-start}")
+
+# Call entrypoint
+if __name__ == "__main__":
+    main()
+
+# End.
diff --git a/tools/vertex-ai-endpoint-load-tester/requests/request_movie.json b/tools/vertex-ai-endpoint-load-tester/requests/request_movie.json
@@ -0,0 +1,20 @@
+{
+  "instances": [
+    {
+      "Id": 3837,
+      "name": "The",
+      "rating": "R",
+      "genre": "Comedy",
+      "year": 2000,
+      "released": "8/3/2001",
+      "director": "John",
+      "writer": "John",
+      "star": "Michael",
+      "country": "United",
+      "budget": 35524924.14,
+      "company": "Pictures",
+      "runtime": 104,
+      "data_cat": "TRAIN"
+    }
+  ]
+}