|
| 1 | +# Pytorch Datastream |
| 2 | + |
| 3 | +[](https://badge.fury.io/py/pytorch-datastream) |
| 4 | +[](https://pypi.python.org/pypi/pytorch-datastream) |
| 5 | +[](https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest) |
| 6 | +[](https://pypi.python.org/pypi/pytorch-datastream) |
| 7 | + |
| 8 | +This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`. |
| 9 | + |
| 10 | +`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`. |
| 11 | + |
| 12 | +`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`. |
| 13 | + |
| 14 | +## Install |
| 15 | + |
| 16 | +```bash |
| 17 | +poetry add pytorch-datastream |
| 18 | +``` |
| 19 | + |
| 20 | +Or, for the old-timers: |
| 21 | + |
| 22 | +```bash |
| 23 | +pip install pytorch-datastream |
| 24 | +``` |
| 25 | + |
| 26 | +## Usage |
| 27 | + |
| 28 | +The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for a more extensive list on API and usage. |
| 29 | + |
| 30 | +```python |
| 31 | +Dataset.from_subscriptable |
| 32 | +Dataset.from_dataframe |
| 33 | +Dataset |
| 34 | +.map |
| 35 | +.subset |
| 36 | +.split |
| 37 | +.cache |
| 38 | +.with_columns |
| 39 | + |
| 40 | +Datastream.merge |
| 41 | +Datastream.zip |
| 42 | +Datastream |
| 43 | +.map |
| 44 | +.data*loader |
| 45 | +.zip_index |
| 46 | +.update_weights* |
| 47 | +.update*example_weight* |
| 48 | +.weight |
| 49 | +.state_dict |
| 50 | +.load_state_dict |
| 51 | +``` |
| 52 | + |
| 53 | +### Simple image dataset example |
| 54 | + |
| 55 | +Here's a basic example of loading images from a directory: |
| 56 | + |
| 57 | +```python |
| 58 | +from datastream import Dataset |
| 59 | +from pathlib import Path |
| 60 | +from PIL import Image |
| 61 | + |
| 62 | +# Assuming images are in a directory structure like: |
| 63 | +# images/ |
| 64 | +# class1/ |
| 65 | +# image1.jpg |
| 66 | +# image2.jpg |
| 67 | +# class2/ |
| 68 | +# image3.jpg |
| 69 | +# image4.jpg |
| 70 | + |
| 71 | +image_dir = Path("images") |
| 72 | +image_paths = list(image_dir.glob("\*_/_.jpg")) |
| 73 | + |
| 74 | +dataset = ( |
| 75 | +Dataset.from_paths( |
| 76 | +image_paths, |
| 77 | +pattern=r".\*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg" |
| 78 | +) |
| 79 | +.map(lambda row: dict( |
| 80 | +image=Image.open(row["path"]), |
| 81 | +class_name=row["class_name"], |
| 82 | +image_name=row["image_name"], |
| 83 | +)) |
| 84 | +) |
| 85 | + |
| 86 | +# Access an item from the dataset |
| 87 | + |
| 88 | +first_item = dataset[0] |
| 89 | +print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}") |
| 90 | +``` |
| 91 | + |
| 92 | +### Merge / stratify / oversample datastreams |
| 93 | + |
| 94 | +The fruit datastreams given below repeatedly yields the string of its fruit type. |
| 95 | + |
| 96 | +````python |
| 97 | + |
| 98 | +> > > datastream = Datastream.merge([ |
| 99 | +> > > ... (apple_datastream, 2), |
| 100 | +> > > ... (pear_datastream, 1), |
| 101 | +> > > ... (banana_datastream, 1), |
| 102 | +> > > ... ]) |
| 103 | +> > > next(iter(datastream.data_loader(batch_size=8))) |
| 104 | +> > > ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana'] |
| 105 | +> > > ``` |
| 106 | + |
| 107 | +### Zip independently sampled datastreams |
| 108 | + |
| 109 | +The fruit datastreams given below repeatedly yields the string of its fruit type. |
| 110 | + |
| 111 | +```python |
| 112 | + |
| 113 | +> > > datastream = Datastream.zip([ |
| 114 | +> > > ... apple_datastream, |
| 115 | +> > > ... Datastream.merge([pear_datastream, banana_datastream]), |
| 116 | +> > > ... ]) |
| 117 | +> > > next(iter(datastream.data_loader(batch_size=4))) |
| 118 | +> > > [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')] |
| 119 | +> > > ``` |
| 120 | + |
| 121 | +### More usage examples |
| 122 | + |
| 123 | +See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for more usage examples. |
| 124 | +```` |
0 commit comments