In this example we will go through the steps required for interactively fine-tuning foundation models on Amazon SageMaker AI by using @remote decorator for executing Training jobs.
You can run this repository from Amazon SageMaker Studio or from your local IDE.
For additional information, take a look at the AWS Blog Fine-tune Falcon 7B and other LLMs on Amazon SageMaker with @remote decorator
💡 Important: The scope of these notebook examples is to showcase interactive experience with SageMaker AI capabilities and @remote decorator, for Small/Large Language Models fine-tuning by using different distribution techniques, such as FSDP, and DDP. After your interactive experimentation and testing, the recommended path to prod is through SageMaker ModelTrainer.
The notebooks are currently using the latest PyTorch Training Container available for the region us-east-1
. If you are running the notebooks in a different region, make sure to update the ImageUri in the file config.yaml.
Python version used in the training container: Python 3.11
- Navigate [Available Deep Learning Containers Images](Available Deep Learning Containers Images)
- Select the right Hugging Face TGI container for model training based on your selected region
- Update ImageUri in the file config.yaml
- [Supervised - QLoRA] Falcon-7B
- [Supervised - QLoRA, FSDP] Llama-13B
- [Self-supervised - QLoRA, FSDP] Llama-13B
- [Self-supervised - QLoRA] Mistral-7B
- [Supervised - QLoRA, FSDP] Mixtral-8x7B
- [Supervised - QLoRA, DDP] Code-Llama 13B
- [Supervised - QLORA, DDP] Llama-3 8B
- [Supervised - QLoRA, DDP] Llama-3.1 8B
- [Supervised - QLoRA, DDP] Arcee AI Llama-3.1 Supernova Lite
- [Supervised - QLoRA] Llama-3.2 1B
- [Supervised - QLoRA] Llama-3.2 3B
- [Supervised - QLoRA, FSDP] Codestral-22B
- [Supervised - LoRA] TinyLlama 1.1B
- [Supervised - FSDP, QLoRA] Arcee Lite 1.5B
- [Supervised - LoRA] SmolLM2-1.7B-Instruct
- [Supervised - QLORA, FSDP] Qwen 2.5 7B
- [Supervised - QLORA] Falcon3 3B
- [Supervised - QLORA, FSDP] Falcon3 7B
- [Supervised - QLORA, FSDP] Llama-3.1 70B
- [Self-supervised - DoRA, FSDP] Mistral-7B v0.3
- [Supervised - QLORA, FSDP] Llama-3.3 70B
- [Supervised - QLORA, FSDP] OpenCoder-8B-Instruct
- [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Qwen-32B
- [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Llama-70B
- [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Llama-8B
- [Supervised - QLORA, DDP] DeepSeek-R1-Distill-Qwen-1.5B
- [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Qwen-7B
- [Supervised - QLORA, FSDP] Qwen3-32B
- [Supervised - QLORA, FSDP] Qwen3-8B
return cloudpickle.loads(bytes_to_deserialize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last): in deserialize return cloudpickle.loads(bytes_to_deserialize)
YYYY-MM-DDThh:mm:ss
AttributeError: Can't get attribute '_function_setstate' on <module 'cloudpickle.cloudpickle' from '/opt/conda/lib/python3.11/site-packages/cloudpickle/cloudpickle.py'>
Align your cloudpickle
local version to the container one, by including in your requirements.txt
:
cloudpickle==x.x.x
Where x.x.x is the version you want to install.
Error when deserializing bytes downloaded from s3:////exception/payload.pkl:
TypeError('unpickle_exception() takes 4 positional arguments but 7 were given').
NOTE: this may be caused by inconsistent sagemaker python sdk versions where remote function runs versus the one used on client side.
If the sagemaker versions do not match, a warning message would be logged starting with 'Inconsistent sagemaker versions found'.
Please check it to validate.
This behavior may happen when you are executing the @remote training job from a local environment with CPU, and run the training workload with GPU instances. This reported exception is not related to SageMaker Python SDK.
The error message shown above is not the actual exception. To investigate the real issue:
- Check the CloudWatch logs generated by the training job
- Look for detailed error messages in the logs that will provide more insight into what's actually failing
Note that environment mismatches between local and remote execution environments can sometimes cause serialization/deserialization issues that manifest with this error.
ModuleNotFoundError("No module named \'torch.accelerator\'"). NOTE: this may be caused by inconsistent sagemaker python sdk versions where remote function runs versus the one used on client side. If the sagemaker versions do not match, a warning message would be logged starting with \'Inconsistent sagemaker versions found\'. Please check it to validate.')
This error happens when your local torch
version is not aligned with the version used in the SageMaker Training container. Make sure you have installed in your development environment the same torch
version used in the Image specified in the file config.yaml
.