Skip to content

Commit 7ebafbe

Browse files
committed
doc: migrate to mkdocs
1 parent 87d1ae2 commit 7ebafbe

25 files changed

+1683
-612
lines changed

.cursorrules

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
- Use pydantic 2
2+
- Pytest
3+
- Use black formatting
4+
- Avoid methods with sideeffects and if they are needed then add a "\_" suffix
5+
- Prefer pathlib over os
6+
- Prefer getter method names like `tasks` over `get_tasks`
7+
- Commands need to be run using `poetry run <command>`
8+
- Use simple tests with a bit of logging that we can run with `poetry run pytest -s` to check that the code works as expected

.github/workflows/gh-pages.yml

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: Deploy Documentation
2+
3+
on:
4+
push:
5+
branches:
6+
- master
7+
workflow_dispatch:
8+
9+
permissions:
10+
contents: write
11+
12+
jobs:
13+
deploy:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v4
17+
18+
- name: Set up Python
19+
uses: actions/setup-python@v4
20+
with:
21+
python-version: '3.10'
22+
23+
- name: Install dependencies
24+
run: |
25+
python -m pip install --upgrade pip
26+
pip install poetry
27+
poetry install
28+
29+
- name: Build documentation
30+
run: poetry run mkdocs build
31+
32+
- name: Deploy to GitHub Pages
33+
uses: peaceiris/actions-gh-pages@v3
34+
with:
35+
github_token: ${{ secrets.GITHUB_TOKEN }}
36+
publish_dir: ./site

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ dist
77
.eggs/
88
build/
99
*.pyc
10+
site/
1011

1112
AUTHORS
1213
ChangeLog

.readthedocs.yml

-30
This file was deleted.

README.md

+124
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Pytorch Datastream
2+
3+
[![PyPI version](https://badge.fury.io/py/pytorch-datastream.svg)](https://badge.fury.io/py/pytorch-datastream)
4+
[![Python versions](https://img.shields.io/pypi/pyversions/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
5+
[![Documentation Status](https://readthedocs.org/projects/pytorch-datastream/badge/?version=latest)](https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest)
6+
[![License](https://img.shields.io/pypi/l/pytorch-datastream.svg)](https://pypi.python.org/pypi/pytorch-datastream)
7+
8+
This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`.
9+
10+
`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`.
11+
12+
`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`.
13+
14+
## Install
15+
16+
```bash
17+
poetry add pytorch-datastream
18+
```
19+
20+
Or, for the old-timers:
21+
22+
```bash
23+
pip install pytorch-datastream
24+
```
25+
26+
## Usage
27+
28+
The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for a more extensive list on API and usage.
29+
30+
```python
31+
Dataset.from_subscriptable
32+
Dataset.from_dataframe
33+
Dataset
34+
.map
35+
.subset
36+
.split
37+
.cache
38+
.with_columns
39+
40+
Datastream.merge
41+
Datastream.zip
42+
Datastream
43+
.map
44+
.data*loader
45+
.zip_index
46+
.update_weights*
47+
.update*example_weight*
48+
.weight
49+
.state_dict
50+
.load_state_dict
51+
```
52+
53+
### Simple image dataset example
54+
55+
Here's a basic example of loading images from a directory:
56+
57+
```python
58+
from datastream import Dataset
59+
from pathlib import Path
60+
from PIL import Image
61+
62+
# Assuming images are in a directory structure like:
63+
# images/
64+
# class1/
65+
# image1.jpg
66+
# image2.jpg
67+
# class2/
68+
# image3.jpg
69+
# image4.jpg
70+
71+
image_dir = Path("images")
72+
image_paths = list(image_dir.glob("\*_/_.jpg"))
73+
74+
dataset = (
75+
Dataset.from_paths(
76+
image_paths,
77+
pattern=r".\*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg"
78+
)
79+
.map(lambda row: dict(
80+
image=Image.open(row["path"]),
81+
class_name=row["class_name"],
82+
image_name=row["image_name"],
83+
))
84+
)
85+
86+
# Access an item from the dataset
87+
88+
first_item = dataset[0]
89+
print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")
90+
```
91+
92+
### Merge / stratify / oversample datastreams
93+
94+
The fruit datastreams given below repeatedly yields the string of its fruit type.
95+
96+
````python
97+
98+
> > > datastream = Datastream.merge([
99+
> > > ... (apple_datastream, 2),
100+
> > > ... (pear_datastream, 1),
101+
> > > ... (banana_datastream, 1),
102+
> > > ... ])
103+
> > > next(iter(datastream.data_loader(batch_size=8)))
104+
> > > ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']
105+
> > > ```
106+
107+
### Zip independently sampled datastreams
108+
109+
The fruit datastreams given below repeatedly yields the string of its fruit type.
110+
111+
```python
112+
113+
> > > datastream = Datastream.zip([
114+
> > > ... apple_datastream,
115+
> > > ... Datastream.merge([pear_datastream, banana_datastream]),
116+
> > > ... ])
117+
> > > next(iter(datastream.data_loader(batch_size=4)))
118+
> > > [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]
119+
> > > ```
120+
121+
### More usage examples
122+
123+
See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for more usage examples.
124+
````

conftest.py

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
def pytest_configure(config):
2+
"""Configure pytest."""
3+
config.addinivalue_line(
4+
"markers",
5+
"codeblocks: mark test to be collected from code blocks",
6+
)

0 commit comments

Comments
 (0)