Skip to content

Model.compile ignores framework_version when compiling for ml_inf1 #3209

Open
@vprecup

Description

@vprecup

Describe the bug
When attempting to compile for ml_inf1 via the SDK a model which was trained / fine-tuned with PyTorch 1.9.1, the framework_version argument is ignored, resulting in a version mismatch between the one at training and the one chosen automatically by SageMaker.

To reproduce

import json
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor

sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.9",
    role="<role_arn>",
    entry_point="inference.py",
    source_dir="code",
    py_version="py3",
    name="<name>"
) 

compiled_inf_model = sm_model.compile(
    target_instance_family="ml_inf1",
    input_shape=<input_shape>,
    job_name="<job_name>",
    role="<role_arn>",
    framework="pytorch",
    framework_version="1.9",
    output_path="<output_path>"
    compiler_options=json.dumps("--dtype int64"),
    compile_max_run=1000, 
)

Expected behavior
The Compilation Job should also contain the "Framework version" when opening it in the AWS Console. However, only the PYTORCH Framework value is present, and the compilation fails after 5 minutes with the error message
ClientError: InputConfiguration: Unable to load PyTorch model:', '\nUnknown type name \'NoneType\':\nSerialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7\n _is_full_backward_hook : Optional[bool]\n def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,\n argument_1: Tensor) -> NoneType:\n ~~~~~~~~ <--- HERE\n return None\n') For further troubleshooting common failures please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html
image

If, however, I clone the failed job in the AWS Console and just add the 1.9 "Framework version" manually, the job runs to completion.

Screenshots or logs

localhost compiler-container-Primary[5078]: Traceback (most recent call last):
--
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5078]:     model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5078]:     script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5078]:     cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5078]: RuntimeError:
localhost compiler-container-Primary[5078]: Unknown type name 'NoneType':
localhost compiler-container-Primary[5078]: Serialized   File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7
localhost compiler-container-Primary[5078]:   _is_full_backward_hook : Optional[bool]
localhost compiler-container-Primary[5078]:   def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,
localhost compiler-container-Primary[5078]:     argument_1: Tensor) -> NoneType:
localhost compiler-container-Primary[5078]:                            ~~~~~~~~ <--- HERE
localhost compiler-container-Primary[5078]:     return None
localhost compiler-container-Primary[5078]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5078]: Traceback (most recent call last):
localhost compiler-container-Primary[5078]:   File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5078]:     compile()
localhost compiler-container-Primary[5078]:   File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5078]:     compiler_options
localhost compiler-container-Primary[5078]:   File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5078]:     return framework_instance.compile_model()
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5078]:     raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5078]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', ' Unknown type name \'NoneType\': Serialized   File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7   _is_full_backward_hook : Optional[bool]   def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,     argument_1: Tensor) -> NoneType:                            ~~~~~~~~ <--- HERE     return None ')

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.97.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 1.9.1
  • Python version: 3.8
  • CPU or GPU: CPU/Inf
  • Custom Docker image (Y/N): -

Additional context
The problem may lie in the negative lookahead regex group (?!ml_inf) at this line: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L735.
Is this condition still applicable?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions