-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
In the PipelineData class, there is a get_env_variable_name() method which returns the name of the environment variable for this dataset (e.g. "$AZUREML_DATAREFERENCE_my_pipelinedata"). This is actually also the __str__ implementation for PipelineData, so that it can easily be used in string formatting to pass it as an argument to a pipeline step, even if you use a custom format for arguments (such as the one of hydra.cc, as also mentioned in https://github.com/MicrosoftDocs/azure-docs/issues/66599):
my_pipelinedata = PipelineData("my_pipelinedata", datastore=datastore, is_directory=True)
train_step = PythonScriptStep(
script_name="train.py",
arguments=[
f"dataset.path={my_pipelinedata}"
]
# ...
)Unfortunately, this is not the case if you want to consume a Dataset. The DatasetConsumptionConfig class does not provide a get_env_variable_name() method, and it doesn't have a custom __str__() implementation either. So, if you want to use it in string formatting for arguments, you have to manually construct the name of the environment variable, which is just a bit more code, but inconsistent with how it is done for PipelineData:
def as_env_variable(dataset):
return f"${dataset.name}"
my_dataset = (
Dataset.get_by_name(workspace, name="my_dataset")
.as_named_input("my_dataset")
.as_mount()
)
train_step = PythonScriptStep(
script_name="train.py",
arguments=[
f"dataset1.path={as_env_variable(my_dataset)}",
f"dataset2.path={my_pipelinedata}"
]
# ...
)So adding a __str__() implementation to the DatasetConsumptionConfig class, or at least a get_env_variable_name() function would make such code more consistent.