Skip to content

Ability to ignore specific files/folders in ModelTrainer's script mode #5091

@discort

Description

@discort

Describe the feature you'd like
I do not want having .git, .env, .vscode, data, __pycache__ or any irrelevant files/folders to be uploaded to S3 artifacts when I use script mode of ModelTrainer in SourceCode. Moreover, coping source_dir may be time-consuming due to the large number of files, such as .git or/and .env.

How would this feature be used? Please describe.
During development/sanity-checing on local machine I have some unnecessary files/folders. The idea is to not upload them during using a script mode.
Let's say I have project structure:

tree
.
├── README.md
├── .git
├── pipeline1
│       ├── train.py
│       ├── data
│       ├── .env
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc
├── pipeline2
│       ├── train.py
│       ├── data
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute

image = "<image>"

source_code = SourceCode(
    source_dir="pipeline1",
    command="python train.py"
)

# or 
# source_code = SourceCode(
# source_dir=".",
# command="python -m pipeline1.train")

compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
     compute=compute,
)
model_trainer.train()

expected result:
s3://<default_bucket_path>/<base_job_name>/input/code/ w/o ignored files/folders

Describe alternatives you've considered

  1. Create TempDir, copy script w/o unnecessary files and pass it as source_dir.
  2. Re-design the project structure to have unwanted files/dirs outside the script you want to upload. (
  3. Always use BYOC instead script mode (isn't practical for some cases, e.g. sanity-check)

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions