auto dataset support

fostiropoulos · Jun 10, 2023 · 9e513d9 · 9e513d9
1 parent 8c495ec
commit 9e513d9
Show file tree

Hide file tree

Showing 6 changed files with 102 additions and 53 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ The code was used for the experiments and results of
 **Batch-Model-Consolidation** [[arXiv]](https://openaccess.thecvf.com/content/CVPR2023/papers/Fostiropoulos_Batch_Model_Consolidation_A_Multi-Task_Model_Consolidation_Framework_CVPR_2023_paper.pdf) [[Website]](https://fostiropoulos.github.io/stream_benchmark/).
 If using this code please cite:
 
-```
+```bibtex
 @inproceedings{fostiropoulos2023batch,
   title={Batch Model Consolidation: A Multi-Task Model Consolidation Framework},
   author={Fostiropoulos, Iordanis and Zhu, Jiaye and Itti, Laurent},
@@ -13,22 +13,22 @@ If using this code please cite:
   year={2023}
 }
 ```
-This repository is a benchmark of methods found in [FACIL](https://github.com/mmasana/FACIL) and [Mammoth](https://github.com/aimagelab/mammoth) combined and adapted to work with the [Stream](https://github.com/fostiropoulos/stream) dataset.
+This repository is a benchmark of methods found in [FACIL](https://github.com/mmasana/FACIL) and [Mammoth](https://github.com/aimagelab/mammoth) combined and adapted to work with the [AutoDS](https://github.com/fostiropoulos/auto-dataset) dataset to evaluate methods on a long sequence of tasks.
 
 
 
 ## Install
 
-1. Install the [Stream dataset](https://github.com/fostiropoulos/stream).
+1. Install the [AutoDS dataset](https://github.com/fostiropoulos/auto-dataset).
 2. `git clone https://github.com/fostiropoulos/stream_benchmark.git`
 3. `cd stream_benchmark`
 4. `pip install . stream_benchmark`
 
 
-## Stream Feature Vectors [Download](https://drive.google.com/file/d/1insLK3FoGw-UEQUNnhzyxsql7z28lplZ/view)
+## AutoDS Feature Vectors [Download](https://drive.google.com/file/d/1insLK3FoGw-UEQUNnhzyxsql7z28lplZ/view)
 
 We use 71 datasets with extracted features from pre-trained models,
-supported in the Stream dataset. [The detailed table](https://github.com/fostiropoulos/stream/blob/cvpr_release/assets/DATASET_TABLE.md).
+supported in the AutoDS dataset. [The detailed table](https://github.com/fostiropoulos/auto-dataset/blob/cvpr_release/assets/DATASET_TABLE.md).
 
 ## Hyperparameters
 
@@ -57,12 +57,14 @@ For `model_name` support see below.
 `{num_gpus}` is the fractional number of GPU to use.
 Set this so that `{GPU usage per experiment} * {num_gpus} < 1`
 
+## Extending
 
+The code in [test_benchmark.py](tests/test_benchmark.py) would be a good starting point in a simple example (ignoring the mock.patching) in understanding how the benchmark can be extended.
 
 
 ## Methods implemented
-| Description                                                      | `model_name`                                                                                              | File                                                 |
-|:-----------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------|:-----------------------------------------------------|
+| Description                                                      | `model_name`                                                                                            | File                                                 |
+| :--------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------ | :--------------------------------------------------- |
 | Continual learning via Gradient Episodic Memory.                 | [gem](https://arxiv.org/abs/1706.08840)                                                                 | [gem.py](stream_benchmark/models/gem.py)             |
 | Continual learning via online EWC.                               | [ewc_on](https://arxiv.org/pdf/1805.06370.pdf)                                                          | [ewc_on.py](stream_benchmark/models/ewc_on.py)       |
 | Continual learning via MAS.                                      | [mas](https://arxiv.org/abs/1711.09601)                                                                 | [mas.py](stream_benchmark/models/mas.py)             |

diff --git a/docs/README.md b/docs/README.md
@@ -9,7 +9,7 @@
 &nbsp;&nbsp;&nbsp;
 <a href="https://github.com/fostiropoulos/stream_benchmark">[Code]</a>
 &nbsp;&nbsp;&nbsp;
-<a href="https://github.com/fostiropoulos/stream">[Dataset]</a>
+<a href="https://github.com/fostiropoulos/auto-dataset">[Dataset]</a>
 </p>
 
 ## Abstract
@@ -53,13 +53,13 @@ The parallelism of this framework enables BMC to learn long task sequences effic
 
 ![Paralleled multi-expert training framework](https://drive.google.com/uc?export=view&id=1NAswFVQtiNn6xkilUig42guGfvi-babV)
 
-## The Stream Dataset
+## Auto-Dataset
 
-Stream dataset implements the logic for processing and managing a large sequence of datasets,
+AutoDS implements the logic for processing and managing a large sequence of datasets,
 and provides a method to train on interdisciplinary tasks by projecting all datasets on the same dimension,
 by extracting features from pre-trained models.
 
-See [the repository](https://github.com/fostiropoulos/stream/tree/cvpr_release) for Stream dataset installation and usages.
+See [the repository](https://github.com/fostiropoulos/auto-dataset/) for dataset installation and usages.
 
 Download the extracted features for Stream datasets [here](https://drive.google.com/file/d/1insLK3FoGw-UEQUNnhzyxsql7z28lplZ/view).
 
@@ -74,7 +74,7 @@ Our implementation of BMC as well as the baselines can be found [here](https://g
 
 ## Citation
 
-```
+```bibtex
 @inproceedings{fostiropoulos2023batch,
   title={Batch Model Consolidation: A Multi-Task Model Consolidation Framework},
   author={Fostiropoulos, Iordanis and Zhu, Jiaye and Itti, Laurent},

diff --git a/setup.py b/setup.py
@@ -7,8 +7,8 @@
     version="1.0",
     description="Stream Benchmark",
     author="Iordanis Fostiropoulos",
-    author_email="dev@iordanis.xyz",
-    url="https://iordanis.xyz/",
+    author_email="mail@iordanis.me",
+    url="https://iordanis.me/",
     python_requires=">3.10",
     long_description=open("README.md").read(),
     packages=find_packages(),
@@ -22,7 +22,7 @@
         "quadprog==0.1.11",
         "pandas==2.0.0",
         "tabulate==0.9.0",
-        "stream @ git+https://github.com/fostiropoulos/stream.git",
+        "autods==1.0",
     ],
     extras_require={
         "dev": [

diff --git a/stream_benchmark/datasets/aux_cifar100.py b/stream_benchmark/datasets/aux_cifar100.py
@@ -1,7 +1,7 @@
 from pathlib import Path
 
-from stream.main import Stream
 from torch.utils.data import DataLoader, Dataset
+from autods.main import AutoDS
 
 
 class AuxDataset(Dataset):
@@ -40,14 +40,21 @@ def make_ds(self):
         root_path = Path(self.dataset_path)
 
         # use the full dataset as aux data, no need to split
-        train_ds = Stream(
-            root_path=root_path, datasets=["cifar100"],  task_id = 0, feats_name="default", train=True
+        train_ds = AutoDS(
+            root_path=root_path,
+            datasets=["cifar100"],
+            task_id=0,
+            feats_name="default",
+            train=True,
         )
-        test_ds = Stream(
-            root_path=root_path, datasets=["cifar100"], task_id = 0,  feats_name="default", train=False
+        test_ds = AutoDS(
+            root_path=root_path,
+            datasets=["cifar100"],
+            task_id=0,
+            feats_name="default",
+            train=False,
         )
         return train_ds, test_ds
 
-
     def get_data_loaders(self):
         return self.train_loader, self.test_loader
diff --git a/stream_benchmark/datasets/seq_stream.py b/stream_benchmark/datasets/seq_stream.py
@@ -1,5 +1,5 @@
 import torch.nn.functional as F
-from stream.main import Stream
+from autods.main import AutoDS
 from torch.utils.data import DataLoader
 from torch.utils.data.dataset import ConcatDataset
 from torchvision import transforms
@@ -37,7 +37,7 @@ def __init__(
         self.feats_name = "default"
         self.image_size = 224
         self.val_image_size = 224
-        mock_ds: Stream = self.make_ds(task_id, True)
+        mock_ds: AutoDS = self.make_ds(task_id, True)
         if isinstance(mock_ds.dataset, ConcatDataset):
             self.dataset_len = [len(ds) for ds in mock_ds.dataset.datasets]
 
@@ -75,7 +75,7 @@ def make_ds(self, task_id, train):
         if self.feats_name is None:
             transform = self.transforms(train)
 
-        s = Stream(
+        s = AutoDS(
             self.root_path,
             task_id=task_id,
             feats_name=self.feats_name,

diff --git a/tests/test_benchmark.py b/tests/test_benchmark.py
@@ -1,5 +1,6 @@
 import copy
 import io
+import json
 import logging
 import shutil
 import tempfile
@@ -9,18 +10,33 @@
 
 import numpy as np
 import torch
+from autods.dataset import Dataset
+from autods.main import AutoDS
+from autods.utils import extract
 from PIL import Image
-from stream.dataset import Dataset
-from stream.main import Stream
-from stream.utils import extract
 
 from stream_benchmark.__main__ import train_method
+from stream_benchmark.datasets.seq_stream import include_ds
 
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+MAKE_FEATS_BATCH_SIZE = 500
+include_ds = ["mock1", "mock2", "mock3"]
+hparams = {
+    "early_stopping_patience": 10,
+    "batch_size": 64,
+    "buffer_size": 10000,
+    "lr": 0.1,
+    "minibatch_size": 64,
+    "n_epochs": 20,
+    "scheduler_threshold": 1e-4,
+    "scheduler_patience": 10,
+    "device": "cuda",
+    "sgd": {},
+}
 
 
 class MockDataset(Dataset):
-    metadata_url = "https://iordanis.xyz/"
+    metadata_url = "https://iordanis.me/"
     remote_urls = {"mock.tar": None}
     name = "mock"
     file_hash_map = {"mock.tar": "blahblah"}
@@ -60,9 +76,7 @@ def __init__(
                 kwargs["action"] = "process"
         super().__init__(*args, **kwargs)
         if mock_download:
-            self.make_features(500, "cuda","clip")
-
-    # ds.make_features(1024, DEVICE, clean=True, feature_extractor="clip")
+            self.make_features(MAKE_FEATS_BATCH_SIZE, DEVICE, "clip")
 
     def _process(self, raw_data_dir: Path):
         archive_path = raw_data_dir.joinpath("mock.tar")
@@ -87,35 +101,61 @@ def _make_metadata(self, raw_data_dir: Path):
         torch.save(metadata, self.metadata_path)
 
 
-class MockDataset2(MockDataset):
-    name = "mock2"
-    pass
+datasets = []
+for ds in include_ds:
 
+    class _MockClass(MockDataset):
+        pass
 
-class MockDataset3(MockDataset):
-    name = "mock3"
-    pass
+    _MockClass.__name__ = ds.upper()
+    _MockClass.name = ds
+    datasets.append(_MockClass)
 
-class MockDataset4(MockDataset):
-    name = "mock4"
-    pass
+sizes = (np.arange(len(datasets)) + 1) * 100
+
+
+def make_ds(self, task_id, train):
 
-def test_benchmark(tmp_path: Path):
-    datasets = [MockDataset, MockDataset2, MockDataset3, MockDataset4]
-    sizes = (np.arange(len(datasets)) + 1) * 100
     with mock.patch(
-        "stream.main.Stream.supported_datasets",
+        "autods.main.AutoDS.supported_datasets",
         return_value=datasets,
     ):
-        for ds, size in zip(datasets, sizes):
-            ds(tmp_path, size=size, mock_download=True)
-
-        with mock.patch(
-            "stream.dataset.Dataset.assert_downloaded", return_value=True
-        ), mock.patch("stream.dataset.Dataset.verify_downloaded", return_value=True):
-            train_method(tmp_path, "sgd", tmp_path, "clip")
-        breakpoint()
-        return
+
+        transform = None
+        if self.feats_name is None:
+            transform = self.transforms(train)
+
+        s = AutoDS(
+            self.root_path,
+            task_id=task_id,
+            feats_name=self.feats_name,
+            train=train,
+            transform=transform,
+            datasets=include_ds,
+        )
+    return s
+
+
+def test_benchmark(tmp_path: Path):
+
+    hpp = tmp_path.joinpath("hparams.json")
+    hpp.write_text(json.dumps(hparams))
+
+    for ds, size in zip(datasets, sizes):
+        ds(tmp_path, size=size, mock_download=True)
+
+    with mock.patch(
+        "stream_benchmark.datasets.seq_stream.SequentialStream.make_ds", make_ds
+    ), mock.patch(
+        "autods.dataset.Dataset.assert_downloaded", return_value=True
+    ), mock.patch(
+        "autods.dataset.Dataset.verify_downloaded", return_value=True
+    ):
+        train_method(
+            save_path=tmp_path, model_name="sgd", dataset_path=tmp_path, hparams=hpp
+        )
+    breakpoint()
+    return
 
 
 if __name__ == "__main__":