Description
Describe the bug
When attempting to compile for ml_inf1 via the SDK a model which was trained / fine-tuned with PyTorch 1.9.1, the framework_version
argument is ignored, resulting in a version mismatch between the one at training and the one chosen automatically by SageMaker.
To reproduce
import json
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
sm_model = PyTorchModel(
model_data=traced_model_url,
predictor_cls=Predictor,
framework_version="1.9",
role="<role_arn>",
entry_point="inference.py",
source_dir="code",
py_version="py3",
name="<name>"
)
compiled_inf_model = sm_model.compile(
target_instance_family="ml_inf1",
input_shape=<input_shape>,
job_name="<job_name>",
role="<role_arn>",
framework="pytorch",
framework_version="1.9",
output_path="<output_path>"
compiler_options=json.dumps("--dtype int64"),
compile_max_run=1000,
)
Expected behavior
The Compilation Job should also contain the "Framework version" when opening it in the AWS Console. However, only the PYTORCH
Framework value is present, and the compilation fails after 5 minutes with the error message
ClientError: InputConfiguration: Unable to load PyTorch model:', '\nUnknown type name \'NoneType\':\nSerialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7\n _is_full_backward_hook : Optional[bool]\n def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,\n argument_1: Tensor) -> NoneType:\n ~~~~~~~~ <--- HERE\n return None\n') For further troubleshooting common failures please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html
If, however, I clone the failed job in the AWS Console and just add the 1.9 "Framework version" manually, the job runs to completion.
Screenshots or logs
localhost compiler-container-Primary[5078]: Traceback (most recent call last):
--
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5078]: model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5078]: script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5078]: cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5078]: RuntimeError:
localhost compiler-container-Primary[5078]: Unknown type name 'NoneType':
localhost compiler-container-Primary[5078]: Serialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7
localhost compiler-container-Primary[5078]: _is_full_backward_hook : Optional[bool]
localhost compiler-container-Primary[5078]: def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,
localhost compiler-container-Primary[5078]: argument_1: Tensor) -> NoneType:
localhost compiler-container-Primary[5078]: ~~~~~~~~ <--- HERE
localhost compiler-container-Primary[5078]: return None
localhost compiler-container-Primary[5078]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5078]: Traceback (most recent call last):
localhost compiler-container-Primary[5078]: File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5078]: compile()
localhost compiler-container-Primary[5078]: File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5078]: compiler_options
localhost compiler-container-Primary[5078]: File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5078]: return framework_instance.compile_model()
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5078]: raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5078]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', ' Unknown type name \'NoneType\': Serialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7 _is_full_backward_hook : Optional[bool] def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh, argument_1: Tensor) -> NoneType: ~~~~~~~~ <--- HERE return None ')
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.97.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: 1.9.1
- Python version: 3.8
- CPU or GPU: CPU/Inf
- Custom Docker image (Y/N): -
Additional context
The problem may lie in the negative lookahead regex group (?!ml_inf)
at this line: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L735.
Is this condition still applicable?