nextml-code
diff --git a/‎.cursorrules
Lines changed: 8 additions & 0 deletions b/‎.cursorrules
Lines changed: 8 additions & 0 deletions
diff --git a/‎.github/workflows/gh-pages.yml
Lines changed: 36 additions & 0 deletions b/‎.github/workflows/gh-pages.yml
Lines changed: 36 additions & 0 deletions
diff --git a/‎.gitignore
Lines changed: 1 addition & 0 deletions b/‎.gitignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎.readthedocs.yml
Lines changed: 0 additions & 30 deletions b/‎.readthedocs.yml
Lines changed: 0 additions & 30 deletions
diff --git a/‎README.md
Lines changed: 124 additions & 0 deletions b/‎README.md
Lines changed: 124 additions & 0 deletions
diff --git a/‎conftest.py
Lines changed: 6 additions & 0 deletions b/‎conftest.py
Lines changed: 6 additions & 0 deletions
@@ -0,0 +1,8 @@
+- Use pydantic 2
+- Pytest
+- Use black formatting
+- Avoid methods with sideeffects and if they are needed then add a "\_" suffix
+- Prefer pathlib over os
+- Prefer getter method names like `tasks` over `get_tasks`
+- Commands need to be run using `poetry run <command>`
+- Use simple tests with a bit of logging that we can run with `poetry run pytest -s` to check that the code works as expected
@@ -0,0 +1,36 @@
+name: Deploy Documentation
+
+on:
+  push:
+    branches:
+      - master
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+          
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install poetry
+          poetry install
+          
+      - name: Build documentation
+        run: poetry run mkdocs build
+        
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@v3
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site 
@@ -7,6 +7,7 @@ dist
 .eggs/
 build/
 *.pyc
+site/
 
 AUTHORS
 ChangeLog
@@ -0,0 +1,124 @@
+# Pytorch Datastream
+
+[![PyPI version](https://badge.fury.io/py/pytorch-datastream.svg)](https://badge.fury.io/py/pytorch-datastream)
+[![Python versions](https://img.shields.io/pypi/pyversions/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
+[![Documentation Status](https://readthedocs.org/projects/pytorch-datastream/badge/?version=latest)](https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest)
+[![License](https://img.shields.io/pypi/l/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
+
+This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`.
+
+`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`.
+
+`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`.
+
+## Install
+
+```bash
+poetry add pytorch-datastream
+```
+
+Or, for the old-timers:
+
+```bash
+pip install pytorch-datastream
+```
+
+## Usage
+
+The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for a more extensive list on API and usage.
+
+```python
+Dataset.from_subscriptable
+Dataset.from_dataframe
+Dataset
+.map
+.subset
+.split
+.cache
+.with_columns
+
+Datastream.merge
+Datastream.zip
+Datastream
+.map
+.data*loader
+.zip_index
+.update_weights*
+.update*example_weight*
+.weight
+.state_dict
+.load_state_dict
+```
+
+### Simple image dataset example
+
+Here's a basic example of loading images from a directory:
+
+```python
+from datastream import Dataset
+from pathlib import Path
+from PIL import Image
+
+# Assuming images are in a directory structure like:
+# images/
+#   class1/
+#     image1.jpg
+#     image2.jpg
+#   class2/
+#     image3.jpg
+#     image4.jpg
+
+image_dir = Path("images")
+image_paths = list(image_dir.glob("\*_/_.jpg"))
+
+dataset = (
+Dataset.from_paths(
+image_paths,
+pattern=r".\*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg"
+)
+.map(lambda row: dict(
+image=Image.open(row["path"]),
+class_name=row["class_name"],
+image_name=row["image_name"],
+))
+)
+
+# Access an item from the dataset
+
+first_item = dataset[0]
+print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")
+```
+
+### Merge / stratify / oversample datastreams
+
+The fruit datastreams given below repeatedly yields the string of its fruit type.
+
+````python
+
+> > > datastream = Datastream.merge([
+> > > ... (apple_datastream, 2),
+> > > ... (pear_datastream, 1),
+> > > ... (banana_datastream, 1),
+> > > ... ])
+> > > next(iter(datastream.data_loader(batch_size=8)))
+> > > ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']
+> > > ```
+
+### Zip independently sampled datastreams
+
+The fruit datastreams given below repeatedly yields the string of its fruit type.
+
+```python
+
+> > > datastream = Datastream.zip([
+> > > ... apple_datastream,
+> > > ... Datastream.merge([pear_datastream, banana_datastream]),
+> > > ... ])
+> > > next(iter(datastream.data_loader(batch_size=4)))
+> > > [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]
+> > > ```
+
+### More usage examples
+
+See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for more usage examples.
+````
@@ -0,0 +1,6 @@
+def pytest_configure(config):
+    """Configure pytest."""
+    config.addinivalue_line(
+        "markers",
+        "codeblocks: mark test to be collected from code blocks",
+    )
-Original file line number
+Diff line change
 .eggs/
 build/
 *.pyc
 +site/
 AUTHORS
 ChangeLog