Skip to content

Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212Β #4488

Open
@Neptun332

Description

@Neptun332

Describe the bug
Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212

To reproduce

model_builder = ModelBuilder(
    model_path=model_path,
    schema_builder=SchemaBuilder(sample_input, sample_output, input_translator=InputTranslator()),
    content_type='application/x-image',
    mode=Mode.SAGEMAKER_ENDPOINT,
    role_arn=role_arn,
    image_uri=image,
    inference_spec=InferenceSpec()
)
built_model = model_builder.build()
built_model.deploy(
    instance_type="ml.c6i.2xlarge",
    endpoint_name="my_endpoint_name",
    initial_instance_count=1)

Expected behavior

  • By default ModelBuilder set == not >=
  • model can be loaded for SageMaker SDK 2.212

Screenshots or logs

2024-03-07T10:23:04.572+01:00	Model server started.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,338 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=64
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,341 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]64
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,357 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,366 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,371 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1709803384370
2024-03-07T10:23:05.324+01:00	2024-03-07T09:23:04,409 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-03-07T10:23:05.324+01:00	2024-03-07T09:23:05,201 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,202 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 184, in handle_connection
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,556 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,556 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.212
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: -
  • Python version: 3.10
  • CPU or GPU: GPU and CPU
  • Custom Docker image (Y/N): N

Additional context
SageMaker endpoint was working for a while and successfully processing requests. The endpoint restarted and installed the latest version of SageMaker SDK (2.212). The endpoint stopped processing requests and printed logs as above. I have noticed that ModelBuilder creates a model package with requirements.txt. In that file, there is sagemaker>=2.199. I modified it and set sagemaker==2.199 which solved the issue.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions