-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the feature you'd like
I do not want having .git
, .env
, .vscode
, data
, __pycache__
or any irrelevant files/folders to be uploaded to S3 artifacts when I use script mode of ModelTrainer
in SourceCode. Moreover, coping source_dir
may be time-consuming due to the large number of files, such as .git
or/and .env
.
How would this feature be used? Please describe.
During development/sanity-checing on local machine I have some unnecessary files/folders. The idea is to not upload them during using a script mode.
Let's say I have project structure:
tree
.
├── README.md
├── .git
├── pipeline1
│ ├── train.py
│ ├── data
│ ├── .env
│ ├── __pycache__
│ │ ├── __init__.cpython-310.pyc
├── pipeline2
│ ├── train.py
│ ├── data
│ ├── __pycache__
│ │ ├── __init__.cpython-310.pyc
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute
image = "<image>"
source_code = SourceCode(
source_dir="pipeline1",
command="python train.py"
)
# or
# source_code = SourceCode(
# source_dir=".",
# command="python -m pipeline1.train")
compute = Compute(
instance_count=1,
instance_type="ml.g5.8xlarge"
)
model_trainer = ModelTrainer(
training_image=image,
source_code=source_code,
compute=compute,
)
model_trainer.train()
expected result:
s3://<default_bucket_path>/<base_job_name>/input/code/
w/o ignored files/folders
Describe alternatives you've considered
- Create TempDir, copy script w/o unnecessary files and pass it as source_dir.
- Re-design the project structure to have unwanted files/dirs outside the script you want to upload. (
- Always use BYOC instead
script
mode (isn't practical for some cases, e.g. sanity-check)
Additional context
Add any other context or screenshots about the feature request here.