Interactive fine-tuning of Foundation Models with Amazon SageMaker Training using @remote decorator

In this example we will go through the steps required for interactively fine-tuning foundation models on Amazon SageMaker AI by using @remote decorator for executing Training jobs.

You can run this repository from Amazon SageMaker Studio or from your local IDE.

For additional information, take a look at the AWS Blog Fine-tune Falcon 7B and other LLMs on Amazon SageMaker with @remote decorator

💡 Important: The scope of these notebook examples is to showcase interactive experience with SageMaker AI capabilities and @remote decorator, for Small/Large Language Models fine-tuning by using different distribution techniques, such as FSDP, and DDP. After your interactive experimentation and testing, the recommended path to prod is through SageMaker ModelTrainer.

Prerequistes

The notebooks are currently using the latest PyTorch Training Container available for the region us-east-1. If you are running the notebooks in a different region, make sure to update the ImageUri in the file config.yaml.

Make sure the Python version in your local environment matches the one in the used container

Python version used in the training container: Python 3.11

⚠️ Make sure your local Python version is aligned with the Python version in the container.

If you want to operate in a different AWS region

Navigate [Available Deep Learning Containers Images](Available Deep Learning Containers Images)
Select the right Hugging Face TGI container for model training based on your selected region
Update ImageUri in the file config.yaml

Notebooks

Troubleshoot

Issue 1 - Error cloudpickle._function_setstate

return cloudpickle.loads(bytes_to_deserialize)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last): in deserialize return cloudpickle.loads(bytes_to_deserialize)
YYYY-MM-DDThh:mm:ss
AttributeError: Can't get attribute '_function_setstate' on <module 'cloudpickle.cloudpickle' from '/opt/conda/lib/python3.11/site-packages/cloudpickle/cloudpickle.py'>

Solution

Align your cloudpickle local version to the container one, by including in your requirements.txt:

cloudpickle==x.x.x

Where x.x.x is the version you want to install.

Issue 2 - Error: TypeError when deserializing bytes from S3

Error when deserializing bytes downloaded from s3:////exception/payload.pkl:
TypeError('unpickle_exception() takes 4 positional arguments but 7 were given').
NOTE: this may be caused by inconsistent sagemaker python sdk versions where remote function runs versus the one used on client side.
If the sagemaker versions do not match, a warning message would be logged starting with 'Inconsistent sagemaker versions found'.
Please check it to validate.

Solution

This behavior may happen when you are executing the @remote training job from a local environment with CPU, and run the training workload with GPU instances. This reported exception is not related to SageMaker Python SDK.

The error message shown above is not the actual exception. To investigate the real issue:

Check the CloudWatch logs generated by the training job
Look for detailed error messages in the logs that will provide more insight into what's actually failing

Note that environment mismatches between local and remote execution environments can sometimes cause serialization/deserialization issues that manifest with this error.

Issue 3 ModuleNotFoundError No module named 'torch.accelerator'

ModuleNotFoundError("No module named \'torch.accelerator\'"). NOTE: this may be caused by inconsistent sagemaker python sdk versions where remote function runs versus the one used on client side. If the sagemaker versions do not match, a warning message would be logged starting with \'Inconsistent sagemaker versions found\'. Please check it to validate.')

Solution

This error happens when your local torch version is not aligned with the version used in the SageMaker Training container. Make sure you have installed in your development environment the same torch version used in the Image specified in the file config.yaml.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
arcee-ai		arcee-ai
deepseek		deepseek
falcon		falcon
huggingface		huggingface
infly		infly
llama		llama
mistral		mistral
qwen		qwen
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interactive fine-tuning of Foundation Models with Amazon SageMaker Training using @remote decorator

Table of Contents

Prerequistes

Make sure the Python version in your local environment matches the one in the used container

If you want to operate in a different AWS region

Notebooks

Troubleshoot

Issue 1 - Error cloudpickle._function_setstate

Solution

Issue 2 - Error: TypeError when deserializing bytes from S3

Solution

Issue 3 ModuleNotFoundError No module named 'torch.accelerator'

Solution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

aws-samples/amazon-sagemaker-llm-fine-tuning-remote-decorator

Folders and files

Latest commit

History

Repository files navigation

Interactive fine-tuning of Foundation Models with Amazon SageMaker Training using @remote decorator

Table of Contents

Prerequistes

Make sure the Python version in your local environment matches the one in the used container

If you want to operate in a different AWS region

Notebooks

Troubleshoot

Issue 1 - Error cloudpickle._function_setstate

Solution

Issue 2 - Error: TypeError when deserializing bytes from S3

Solution

Issue 3 ModuleNotFoundError No module named 'torch.accelerator'

Solution

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages