The following is a brief description of the process of using go-infer to quickly build a model inference API and deploy it. For details, please refer to the complete examples in the examples directory.
The request processing flow is as shown below. The go-infer framework hides common logic such as API concurrent request processing, inference request serialization queuing, and the Dispatcher distribution services. When users develop APIs, they only need to process the parameters of the API request (API entry function) and the model inference part (model inference function ).
In the code implementation, the framework defines a simplified model interface. As long as the user implements the method functions under it, the framework can implement the process in the above figure. The model interface is defined as follows:
// Model interface definition
type Model interface {
ApiPath() (string) // HTTP URL
ApiEntry(*map[string]interface{}) (*map[string]interface{}, error) // Process of handling API parameters
Init() (error) // Model initialization, loading weights, etc.
Infer(string, *map[string]interface{}) (*map[string]interface{}, error) // The process of model inference
}
- ApiPath() is simple. It returns the URL path string of the API. When the HTTP server is started, the URL service will be registered based on this URL path.
- ApiEntry() is used to process the parameters passed in by the API, and usually checks the validity of the parameters based on business logic. The input parameter is a key-value map, including the content of the data field passed in by the API (for the API input parameter structure, please refer to API Document Template), and the return value is also a key-value map, including the content passed to inference. The content of the function.
- Init() is used to load model weights and model initialization related work. It will be called when the Dispatcher server starts.
- Infer() is used to implement specific inference services. The input parameter is the parameter data processed by requestId and ApiEntry(), and the output parameter is the content returned in the data field in the API return result.
For specific examples, please refer to code example
The framework provides command line integration that can be integrated into the user's command line instructions:
// Add model instance
types.ModelList = append(types.ModelList, &embedding.BertEMB{})
// Command line settings
rootCmd.AddCommand(cli.HttpCmd)
rootCmd.AddCommand(cli.ServerCmd)
Before adding the command line, add the model interface implemented above to the ModelList of the framework. The framework will perform initialization and registration, and find the corresponding model during request processing.
For details, please refer to code example
For specific content, please refer to configuration file example。
The configuration file path defaults to config/settings.yaml
. You can use --yaml
to specify the configuration file path in the command line server and http command parameters.
Please refer to export_tf_bert.py
Please refer to export_keras_cnn.py
Compiling
cd examples
make
Start the Dispatcher and inference service
build/go-embedding server 0
Start HTTP API service
build/go-embedding http
API testing
python3 test_api localhost mobile
The go-infer framework has implemented serialized execution of API concurrent processing (Http server) and inference module (Dispatcher server). Therefore, in actual application deployment, the main consideration is the processing capability during peak concurrency periods. When concurrency is not high during peak periods, you can choose stand-alone deployment, that is, deploy the HTTP server and Dispatcher server on the same physical server. When the amount of concurrency increases during peak periods, there are three main options to improve concurrent processing capabilities (only the deployment of the CPU environment is considered here):
- Increase the computing power of a single server.
- Http server, Dispatcher server and redis are deployed on 3 different servers respectively.
- Based on option 2, analyze the computing power bottleneck and horizontally expand the HTTP server and Dispatcher respectively.
Here we take option 3 as an example for demonstration. Please refer to the following figure for the deployment architecture:
First, assume that the deployment environment is as follows:
- One nginx server (192.168.0.100) serves as the API request entry and performs load balancing
- One redis server (192.168.0.101) (Considering system stability, a redis cluster can be deployed. Please refer to the redis documentation for details)
- Two HTTP servers (192.168.0.102, 192.168.0.103) provide API processing services
- Four Dispatcher servers (192.168.0.104, 192.168.0.105, 192.168.0.106, 192.168.0.107)
Among them, in order to distribute API requests to the inference server evenly, two redis queues are established, each HTTP server is associated with one queue, and each queue backend is associated with two Dispather servers. In this way, concurrent requests of each HTTP server are processed by two Dispatcher servers. Moreover, each Dispatcher server sets the MaxWorkers parameter according to the number of CPU cores. (assuming the server is 8 cores)
Note: The following is only a fragment of the configuration file, other content needs to be completed.
upstream goinfer {
least_conn;
server 192.168.0.102:5000;
server 192.168.0.103:5000;
}
server {
listen 5000;
location / {
proxy_pass http://goinfer;
}
}
bind 192.168.0.101
port 7480
requirepass e18ffb7484f4d69c2acb40008471a71c
client-output-buffer-limit pubsub 32mb 8mb 60
# HTTP server parameters
API:
Port: 5000
Addr: 0.0.0.0
SM2PrivateKey: "JShsBOJL0RgPAoPttEB1hgtPAvCikOl0V1oTOYL7k5U=" # SM2 private key
AppIdSecret: { # The appid and secret assigned for the API calls
"3EA25569454745D01219080B779F021F" : "41DF0E6AE27B5282C07EF5124642A352",
}
# Parameters for the inference service queue
Server:
RedisServer: "127.0.0.1:7480"
RedisPasswd: "e18ffb7484f4d69c2acb40008471a71c"
MessageTimeout: 10 # Maximum waiting time for inference calls
MaxWorkers: 8 # Maximum number of concurrent inferences (recommended to be the same as the number of CPU cores)
settings.yaml at 192.168.0.102
Server:
QueueName: "goinfer-synchronous-asynchronous-queue_102"
settings.yaml at 192.168.0.103
Server:
QueueName: "goinfer-synchronous-asynchronous-queue_103"
Start command
build/go-embedding http
settings.yaml
Server:
QueueName: "goinfer-synchronous-asynchronous-queue_102"
Start command
build/go-embedding server 0
settings.yaml
Server:
QueueName: "goinfer-synchronous-asynchronous-queue_102"
Start command
build/go-embedding server 1
settings.yaml
Server:
QueueName: "goinfer-synchronous-asynchronous-queue_103"
Start command
build/go-embedding server 0
settings.yaml
Server:
QueueName: "goinfer-synchronous-asynchronous-queue_103"
Start command
build/go-embedding server 1
python3 test_api "192.168.0.100" mobile