Skip to content

Commit

Permalink
Add docs for lakeFS<> mlflow integration (#8599)
Browse files Browse the repository at this point in the history

---------

Co-authored-by: Barak Amar <[email protected]>
  • Loading branch information
talSofer and nopcoder authored Feb 6, 2025
1 parent 7017f27 commit 3f4af4f
Show file tree
Hide file tree
Showing 4 changed files with 271 additions and 1 deletion.
Binary file added docs/assets/img/logos/MLflow-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/mlflow_inspect_experiment_run.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/integrations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ See below for detailed instructions for using different technologies with lakeFS
<td width="25%" align=center><a href="./r.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/r.png" alt="r logo"/><br/>R</a></td>
<td width="25%" align=center><a href="./red_hat_openshift_ai.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/red_hat_openshift_ai.png" alt="Red Hat OpenShift AI Logo"/><br/>Red Hat OpenShift AI</a></td>
<td width="25%" align=center><a href="./vertex_ai.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/vertex_ai.png" alt="Vertex AI Logo"/><br/>Vertex AI</a></td>
<td width="25%" align=center></td>
<td width="25%" align=center><a href="./mlflow.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/MLflow-logo.png" alt="MLflow Logo"/><br/>MLflow</a></td>
</tr>
</table>

Expand Down
270 changes: 270 additions & 0 deletions docs/integrations/mlflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
---
title: MLflow
description: How to use MLflow with lakeFS
parent: Integrations
---

# Using MLflow with lakeFS

[MLflow](https://mlflow.org/docs/latest/index.html) is a comprehensive tool designed to manage the machine learning lifecycle,
assisting practitioners and teams in handling the complexities of ML processes. It focuses on the full lifecycle of machine
learning projects, ensuring that each phase is manageable, traceable, and reproducible.

MLflow comprises multiple core components, and lakeFS seamlessly integrates with the [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html#tracking)
component. MLflow tracking enables experiment tracking that accounts for both inputs and outputs, allowing for visualization
and comparison of experiment results.

{% include toc_2-3.html %}

## Benefits of integrating MLflow with lakeFS

Integrating MLflow with lakeFS offers several advantages that enhance the machine learning workflow:
1. **Experiment Reproducibility**: By leveraging MLflow's [input logging](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input)
capabilities alongside lakeFS's data versioning, you can precisely track the specific dataset version used in each experiment
run. This ensures that experiments remain reproducible over time, even as datasets evolve.
2. **Parallel Experiments with Zero Data Copy**: Parallel Experiments with Zero Data Copy: lakeFS enables efficient [branching](../understand/model.md#branches) without
duplicating data. This allows for multiple experiments to be conducted in parallel, with each branch providing an isolated
environment for dataset modifications. Changes in one branch do not affect others, promoting safe collaboration among
team members. Once an experiment is complete, the branch can be seamlessly merged back into the main dataset, incorporating
new insights.

## How to use MLflow with lakeFS

To harness the combined capabilities of MLflow and lakeFS for safe experimentation and accurate result reproduction, consider
the workflow below and review the practical examples provided on the next section.

### Recommended workflow

1. **Create a branch for each experiment**: Start each experiment by creating a dedicated lakeFS branch for it. This approach
allows you to safely make changes to your input dataset without duplicating it. You will later load data from this branch
to your MLflow experiment runs.
2. **Read datasets from the experiment branch**: Read Datasets from the Experiment Branch: Conduct your experiments by
reading data directly from the dedicated branch. We recommend to read the dataset from the head commit of the branch to
ensure precise version tracking.
3. **Create an MLflow Dataset pointing to lakeFS**: Use MLflow's [Dataset](https://mlflow.org/docs/latest/python_api/mlflow.data.html#mlflow.data.dataset.Dataset)
ensuring that the [dataset source](https://mlflow.org/docs/latest/python_api/mlflow.data.html#mlflow.data.dataset_source.DatasetSource)
points to lakeFS.
4. **Log your input**: Use MLflow's [log_input](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input)
function to log the versioned dataset stored in lakeFS.
5. **Commit dataset changes**: Machine learning development is inherently iterative. When you make changes to your input dataset,
commit them to the experiment branch in lakeFS with a meaningful commit message. During an experiment run, load the dataset
version corresponding to the branch's head commit and track this reference to facilitate future result reproduction.
6. **Merge experiment results**: After concluding your experimentation, merge the branch used for the selected experiment
run back into the main branch.

{: .note}

**Note: Branch per experiment Vs. Branch per experiment run**<br>
While it's possible to create a lakeFS branch for each experiment run, given that lakeFS branches are both quick and
cost-effective to create, it's often more efficient to create a branch per experiment. By reading directly from the head
commit of the experiment branch, you can distinguish between dataset versions without creating excessive branches. This
practice maintains branch hygiene within lakeFS.

### Example: Using Pandas

```python
import lakefs
import mlflow
import pandas as pd

repo = lakefs.Repository("my-repo")
repo_id = repo.id

exp_branch = repo.branch("experiment-1").create(source_reference="main", exist_ok=True)
branch_id = exp_branch.id
head_commit_id = exp_branch.head.id

table_path = "famous_people.csv"

dataset_source_url = f"s3://{repo_id}/{head_commit_id}/{table_path}"

# Use Pandas to read from lakeFS, at its most updated version to which the head commit id is pointing
raw_data = pd.read_csv(dataset_source_url, delimiter=";", storage_options={
"key": "AKIAIOSFOLKFSSAMPLES",
"secret": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"client_kwargs": {"endpoint_url": "http://localhost:8000"}
})

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
raw_data, source=dataset_source_url, name="famous_people"
)

# View some of the recorded Dataset information
print(f"Dataset name: {dataset.name}")
print(f"Dataset source URI: {dataset.source.uri}")

# Use mlflow input logging to track the dataset versioned by lakeFS
with mlflow.start_run() as run:
mlflow.log_input(dataset, context="training")
mlflow.set_tag("lakefs_repo", repo_id)
mlflow.set_tag("lakefs_branch", branch_id)

# Inspect run's dataset
logged_run = mlflow.get_run(run.info.run_id) #

# Retrieve the Dataset object
logged_dataset = logged_run.inputs.dataset_inputs[0].dataset

# View some of the recorded Dataset information
print(f"Logged dataset name: {logged_dataset.name}")
print(f"Logged dataset source URI: {logged_dataset.source}")
```

Output
```text
Dataset name: famous_people
Dataset source URI: s3://my-repo/3afddad4fef987b4919f5e82f16682c018f59ed2ff003a6a81adf72edaad23c3/fp.csv
Logged dataset name: famous_people
Logged dataset source URI: {"uri": "s3://my-repo/3afddad4fef987b4919f5e82f16682c018f59ed2ff003a6a81adf72edaad23c3/fp.csv"}
```

### Example: Using Spark

The example below configures Spark to access lakeFS' [S3-compatible API](spark.md#s3-compatible-api) and load a Delta
Lake tables to the experiment.

```python
import lakefs
import mlflow
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("lakeFS / Mlflow") \
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.endpoint", 'http://localhost:8000') \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.access.key", 'AKIAlakefs12345EXAMPLE') \
.config("spark.hadoop.fs.s3a.secret.key", 'abc/lakefs/1234567bPxRfiCYEXAMPLEKEY') \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

repo = lakefs.Repository("my-repo")
repo_id = repo.id

exp_branch = repo.branch("experiment-1").create(source_reference="main", exist_ok=True)
branch_id = exp_branch.id
head_commit_id = exp_branch.head.id

table_path = "gold/train_v2/"

dataset_source_url = f"s3://{repo_id}/{head_commit_id}/{table_path}"

# Load delta lake table from lakeFS, at its most updated version to which the head commit id is pointing
dataset = mlflow.data.load_delta(path=dataset_source_url, name="boat-images")

# View some of the recorded Dataset information
print(f"Dataset name: {dataset.name}")
print(f"Dataset source URI: {dataset.source.path}")

# Use mlflow input logging to track the dataset versioned by lakeFS
with mlflow.start_run() as run:
mlflow.log_input(dataset, context="training")
mlflow.set_tag("lakefs_repo", repo_id)
mlflow.set_tag("lakefs_branch", branch_id) # Log the branch id, to have a friendly lakeFS reference to search the input dataset in

# Inspect run's dataset
logged_run = mlflow.get_run(run.info.run_id) #

# Retrieve the Dataset object
logged_dataset = logged_run.inputs.dataset_inputs[0].dataset

# View some of the recorded Dataset information
print(f"Logged dataset name: {logged_dataset.name}")
print(f"Logged dataset source URI: {logged_dataset.source}")
```

Output
```text
Dataset name: boat-images
Dataset source URI: s3://my-repo/3afddad4fef987b4919f5e82f16682c018f59ed2ff003a6a81adf72edaad23c3/gold/train_v2/
Logged dataset name: boat-images
Logged dataset source URI: {"path": "s3://my-repo/3afddad4fef987b4919f5e82f16682c018f59ed2ff003a6a81adf72edaad23c3/gold/train_v2/"}
```

### Reproduce experiment results

To reproduce the results of a specific experiment run in MLflow, it's essential to retrieve the exact dataset and associated
metadata used during that run. While the MLflow Tracking UI provides a general overview, detailed dataset information
and its source are best accessed programmatically.

1. Obtain the Run ID: Navigate to the MLflow UI and copy the Run ID of the experiment you're interested in.

![mlflow run](../assets/img/mlflow_inspect_experiment_run.png)

2. Extract Dataset Information Using MLflow's Python SDK:

```python
import mlflow

# Inspect run's dataset and tags
run_id = "c0f8fbb1b63748abaa0a6479115e272c"
run = mlflow.get_run(run_id)

# Retrieve the Dataset object
logged_dataset = run.inputs.dataset_inputs[0].dataset

# View some of the recorded Dataset information
print(f"Run ID: {run_id} Dataset name: {logged_dataset.name}")
print(f"Run ID: {run_id} Dataset source URI: {logged_dataset.source}")

# Retrieve run's tags
logged_tags = run.data.tags
print(f"Run ID: {run_id} tags: {logged_tags}")
```

Output
```text
Run ID: c0f8fbb1b63748abaa0a6479115e272c Dataset name: boat-images
Run ID: c0f8fbb1b63748abaa0a6479115e272c Dataset source URI: {"path": "s3://my-repo/3afddad4fef987b4919f5e82f16682c018f59ed2ff003a6a81adf72edaad23c3/gold/train_v2/"}
Run ID: c0f8fbb1b63748abaa0a6479115e272c tags: {'lakefs_branch': 'experiment-1', 'lakefs_repo': 'my-repo'}
```

Notes:
* The Dataset Source URI provides the location of the dataset at the exact version used in the run.
* Run tags, such as 'lakefs_branch' and 'lakefs_repo', offer additional context about the dataset's origin within lakeFS.

### Compare runs input

To determine whether two distinct MLflow runs utilized the same input dataset, you can compare specific attributes of
their logged Dataset objects. The source attribute, which contains the versioned dataset's URI, is a common choice for
this comparison. Here's an example:

```python
import mlflow

first_run_id = "4c0464d665944dc5bb90587d455948b8"
first_run = mlflow.get_run(first_run_id)

# Retrieve the Dataset object
first_dataset = first_run.inputs.dataset_inputs[0].dataset
first_dataset_src = first_dataset.source

sec_run_id = "12b91e073a8b40df97ea8d570534de31"
sec_run = mlflow.get_run(sec_run_id)

# Retrieve the Dataset object
sec_dataset = sec_run.inputs.dataset_inputs[0].dataset
sec_dataset_src = sec_dataset.source

assert first_dataset_src == sec_dataset_src, "Dataset sources are not equal."

print(f"First dataset src: {first_dataset_src}")
print(f"Second dataset src: {sec_dataset_src}")
```

Output
```text
First dataset src: {"uri": "s3://mlflow-tracking/f16682c0186a81adf72edaad23c3f59ed2ff3afddad4fef987b4919f5e82003a/gold/train_v2/"}
Second dataset src: {"uri": "s3://mlflow-tracking/3afddad4fef987b4919f5e82f16682c018f59ed2ff003a6a81adf72edaad23c3/gold/train_v2/"}
```

In this example, the source attribute of each Dataset object is compared to determine if the input datasets are identical.
If they differ, you can further inspect them. With the dataset source URI in hand, you can use lakeFS to gain more insights
about the changes made to your dataset:
* Inspect the lakeFS Commit ID: By examining the commit ID within the URI, you can retrieve detailed information about
the commit, including the author and the purpose of the changes.
* Use lakeFS Diff: lakeFS offers a diff function that allows you to compare different versions of your data.

By leveraging these tools, you can effectively track and understand the evolution of your datasets across different MLflow runs.

0 comments on commit 3f4af4f

Please sign in to comment.