Open
Description
Describe the bug
Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212
To reproduce
model_builder = ModelBuilder(
model_path=model_path,
schema_builder=SchemaBuilder(sample_input, sample_output, input_translator=InputTranslator()),
content_type='application/x-image',
mode=Mode.SAGEMAKER_ENDPOINT,
role_arn=role_arn,
image_uri=image,
inference_spec=InferenceSpec()
)
built_model = model_builder.build()
built_model.deploy(
instance_type="ml.c6i.2xlarge",
endpoint_name="my_endpoint_name",
initial_instance_count=1)
Expected behavior
- By default
ModelBuilder
set==
not>=
- model can be loaded for SageMaker SDK
2.212
Screenshots or logs
2024-03-07T10:23:04.572+01:00 Model server started.
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,338 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=64
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,341 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]64
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,357 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,366 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-03-07T10:23:04.572+01:00 2024-03-07T09:23:04,371 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1709803384370
2024-03-07T10:23:05.324+01:00 2024-03-07T09:23:04,409 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-03-07T10:23:05.324+01:00 2024-03-07T09:23:05,201 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,202 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 184, in handle_connection
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,555 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,556 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-03-07T10:23:05.575+01:00 2024-03-07T09:23:05,556 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.212
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: -
- Python version: 3.10
- CPU or GPU: GPU and CPU
- Custom Docker image (Y/N): N
Additional context
SageMaker endpoint was working for a while and successfully processing requests. The endpoint restarted and installed the latest version of SageMaker SDK (2.212
). The endpoint stopped processing requests and printed logs as above. I have noticed that ModelBuilder
creates a model package with requirements.txt
. In that file, there is sagemaker>=2.199
. I modified it and set sagemaker==2.199
which solved the issue.