Skip to content

aws-samples/amazon-sagemaker-llm-fine-tuning-remote-decorator

Interactive fine-tuning of Foundation Models with Amazon SageMaker Training using @remote decorator

In this example we will go through the steps required for interactively fine-tuning foundation models on Amazon SageMaker AI by using @remote decorator for executing Training jobs.

You can run this repository from Amazon SageMaker Studio or from your local IDE.

For additional information, take a look at the AWS Blog Fine-tune Falcon 7B and other LLMs on Amazon SageMaker with @remote decorator

💡 Important: The scope of these notebook examples is to showcase interactive experience with SageMaker AI capabilities and @remote decorator, for Small/Large Language Models fine-tuning by using different distribution techniques, such as FSDP, and DDP. After your interactive experimentation and testing, the recommended path to prod is through SageMaker ModelTrainer.

Table of Contents

Prerequistes

The notebooks are currently using the latest PyTorch Training Container available for the region us-east-1. If you are running the notebooks in a different region, make sure to update the ImageUri in the file config.yaml.

Make sure the Python version in your local environment matches the one in the used container

Python version used in the training container: Python 3.11

⚠️ Make sure your local Python version is aligned with the Python version in the container.

If you want to operate in a different AWS region

  1. Navigate [Available Deep Learning Containers Images](Available Deep Learning Containers Images)
  2. Select the right Hugging Face TGI container for model training based on your selected region
  3. Update ImageUri in the file config.yaml

Notebooks

  1. [Supervised - QLoRA] Falcon-7B
  2. [Supervised - QLoRA, FSDP] Llama-13B
  3. [Self-supervised - QLoRA, FSDP] Llama-13B
  4. [Self-supervised - QLoRA] Mistral-7B
  5. [Supervised - QLoRA, FSDP] Mixtral-8x7B
  6. [Supervised - QLoRA, DDP] Code-Llama 13B
  7. [Supervised - QLORA, DDP] Llama-3 8B
  8. [Supervised - QLoRA, DDP] Llama-3.1 8B
  9. [Supervised - QLoRA, DDP] Arcee AI Llama-3.1 Supernova Lite
  10. [Supervised - QLoRA] Llama-3.2 1B
  11. [Supervised - QLoRA] Llama-3.2 3B
  12. [Supervised - QLoRA, FSDP] Codestral-22B
  13. [Supervised - LoRA] TinyLlama 1.1B
  14. [Supervised - FSDP, QLoRA] Arcee Lite 1.5B
  15. [Supervised - LoRA] SmolLM2-1.7B-Instruct
  16. [Supervised - QLORA, FSDP] Qwen 2.5 7B
  17. [Supervised - QLORA] Falcon3 3B
  18. [Supervised - QLORA, FSDP] Falcon3 7B
  19. [Supervised - QLORA, FSDP] Llama-3.1 70B
  20. [Self-supervised - DoRA, FSDP] Mistral-7B v0.3
  21. [Supervised - QLORA, FSDP] Llama-3.3 70B
  22. [Supervised - QLORA, FSDP] OpenCoder-8B-Instruct
  23. [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Qwen-32B
  24. [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Llama-70B
  25. [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Llama-8B
  26. [Supervised - QLORA, DDP] DeepSeek-R1-Distill-Qwen-1.5B
  27. [Supervised - QLORA, FSDP] DeepSeek-R1-Distill-Qwen-7B
  28. [Supervised - QLORA, FSDP] Qwen3-32B
  29. [Supervised - QLORA, FSDP] Qwen3-8B

Troubleshoot

Issue 1 - Error cloudpickle._function_setstate

return cloudpickle.loads(bytes_to_deserialize)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last): in deserialize return cloudpickle.loads(bytes_to_deserialize)
YYYY-MM-DDThh:mm:ss
AttributeError: Can't get attribute '_function_setstate' on <module 'cloudpickle.cloudpickle' from '/opt/conda/lib/python3.11/site-packages/cloudpickle/cloudpickle.py'>

Solution

Align your cloudpickle local version to the container one, by including in your requirements.txt:

cloudpickle==x.x.x

Where x.x.x is the version you want to install.


Issue 2 - Error: TypeError when deserializing bytes from S3

Error when deserializing bytes downloaded from s3:////exception/payload.pkl:
TypeError('unpickle_exception() takes 4 positional arguments but 7 were given').
NOTE: this may be caused by inconsistent sagemaker python sdk versions where remote function runs versus the one used on client side.
If the sagemaker versions do not match, a warning message would be logged starting with 'Inconsistent sagemaker versions found'.
Please check it to validate.

Solution

This behavior may happen when you are executing the @remote training job from a local environment with CPU, and run the training workload with GPU instances. This reported exception is not related to SageMaker Python SDK.

The error message shown above is not the actual exception. To investigate the real issue:

  1. Check the CloudWatch logs generated by the training job
  2. Look for detailed error messages in the logs that will provide more insight into what's actually failing

Note that environment mismatches between local and remote execution environments can sometimes cause serialization/deserialization issues that manifest with this error.

Issue 3 ModuleNotFoundError No module named 'torch.accelerator'

ModuleNotFoundError("No module named \'torch.accelerator\'"). NOTE: this may be caused by inconsistent sagemaker python sdk versions where remote function runs versus the one used on client side. If the sagemaker versions do not match, a warning message would be logged starting with \'Inconsistent sagemaker versions found\'. Please check it to validate.')

Solution

This error happens when your local torch version is not aligned with the version used in the SageMaker Training container. Make sure you have installed in your development environment the same torch version used in the Image specified in the file config.yaml.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •