|  | 
|  | 1 | +# Pytorch Datastream | 
|  | 2 | + | 
|  | 3 | +[](https://badge.fury.io/py/pytorch-datastream) | 
|  | 4 | +[](https://pypi.python.org/pypi/pytorch-datastream) | 
|  | 5 | +[](https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest) | 
|  | 6 | +[](https://pypi.python.org/pypi/pytorch-datastream) | 
|  | 7 | + | 
|  | 8 | +This is a simple library for creating readable dataset pipelines and reusing best practices for issues such as imbalanced datasets. There are just two components to keep track of: `Dataset` and `Datastream`. | 
|  | 9 | + | 
|  | 10 | +`Dataset` is a simple mapping between an index and an example. It provides pipelining of functions in a readable syntax originally adapted from tensorflow 2's `tf.data.Dataset`. | 
|  | 11 | + | 
|  | 12 | +`Datastream` combines a `Dataset` and a sampler into a stream of examples. It provides a simple solution to oversampling / stratification, weighted sampling, and finally converting to a `torch.utils.data.DataLoader`. | 
|  | 13 | + | 
|  | 14 | +## Install | 
|  | 15 | + | 
|  | 16 | +```bash | 
|  | 17 | +poetry add pytorch-datastream | 
|  | 18 | +``` | 
|  | 19 | + | 
|  | 20 | +Or, for the old-timers: | 
|  | 21 | + | 
|  | 22 | +```bash | 
|  | 23 | +pip install pytorch-datastream | 
|  | 24 | +``` | 
|  | 25 | + | 
|  | 26 | +## Usage | 
|  | 27 | + | 
|  | 28 | +The list below is meant to showcase functions that are useful in most standard and non-standard cases. It is not meant to be an exhaustive list. See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for a more extensive list on API and usage. | 
|  | 29 | + | 
|  | 30 | +```python | 
|  | 31 | +Dataset.from_subscriptable | 
|  | 32 | +Dataset.from_dataframe | 
|  | 33 | +Dataset | 
|  | 34 | +.map | 
|  | 35 | +.subset | 
|  | 36 | +.split | 
|  | 37 | +.cache | 
|  | 38 | +.with_columns | 
|  | 39 | + | 
|  | 40 | +Datastream.merge | 
|  | 41 | +Datastream.zip | 
|  | 42 | +Datastream | 
|  | 43 | +.map | 
|  | 44 | +.data*loader | 
|  | 45 | +.zip_index | 
|  | 46 | +.update_weights* | 
|  | 47 | +.update*example_weight* | 
|  | 48 | +.weight | 
|  | 49 | +.state_dict | 
|  | 50 | +.load_state_dict | 
|  | 51 | +``` | 
|  | 52 | + | 
|  | 53 | +### Simple image dataset example | 
|  | 54 | + | 
|  | 55 | +Here's a basic example of loading images from a directory: | 
|  | 56 | + | 
|  | 57 | +```python | 
|  | 58 | +from datastream import Dataset | 
|  | 59 | +from pathlib import Path | 
|  | 60 | +from PIL import Image | 
|  | 61 | + | 
|  | 62 | +# Assuming images are in a directory structure like: | 
|  | 63 | +# images/ | 
|  | 64 | +#   class1/ | 
|  | 65 | +#     image1.jpg | 
|  | 66 | +#     image2.jpg | 
|  | 67 | +#   class2/ | 
|  | 68 | +#     image3.jpg | 
|  | 69 | +#     image4.jpg | 
|  | 70 | + | 
|  | 71 | +image_dir = Path("images") | 
|  | 72 | +image_paths = list(image_dir.glob("\*_/_.jpg")) | 
|  | 73 | + | 
|  | 74 | +dataset = ( | 
|  | 75 | +Dataset.from_paths( | 
|  | 76 | +image_paths, | 
|  | 77 | +pattern=r".\*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg" | 
|  | 78 | +) | 
|  | 79 | +.map(lambda row: dict( | 
|  | 80 | +image=Image.open(row["path"]), | 
|  | 81 | +class_name=row["class_name"], | 
|  | 82 | +image_name=row["image_name"], | 
|  | 83 | +)) | 
|  | 84 | +) | 
|  | 85 | + | 
|  | 86 | +# Access an item from the dataset | 
|  | 87 | + | 
|  | 88 | +first_item = dataset[0] | 
|  | 89 | +print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}") | 
|  | 90 | +``` | 
|  | 91 | + | 
|  | 92 | +### Merge / stratify / oversample datastreams | 
|  | 93 | + | 
|  | 94 | +The fruit datastreams given below repeatedly yields the string of its fruit type. | 
|  | 95 | + | 
|  | 96 | +````python | 
|  | 97 | + | 
|  | 98 | +> > > datastream = Datastream.merge([ | 
|  | 99 | +> > > ... (apple_datastream, 2), | 
|  | 100 | +> > > ... (pear_datastream, 1), | 
|  | 101 | +> > > ... (banana_datastream, 1), | 
|  | 102 | +> > > ... ]) | 
|  | 103 | +> > > next(iter(datastream.data_loader(batch_size=8))) | 
|  | 104 | +> > > ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana'] | 
|  | 105 | +> > > ``` | 
|  | 106 | + | 
|  | 107 | +### Zip independently sampled datastreams | 
|  | 108 | + | 
|  | 109 | +The fruit datastreams given below repeatedly yields the string of its fruit type. | 
|  | 110 | + | 
|  | 111 | +```python | 
|  | 112 | + | 
|  | 113 | +> > > datastream = Datastream.zip([ | 
|  | 114 | +> > > ... apple_datastream, | 
|  | 115 | +> > > ... Datastream.merge([pear_datastream, banana_datastream]), | 
|  | 116 | +> > > ... ]) | 
|  | 117 | +> > > next(iter(datastream.data_loader(batch_size=4))) | 
|  | 118 | +> > > [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')] | 
|  | 119 | +> > > ``` | 
|  | 120 | + | 
|  | 121 | +### More usage examples | 
|  | 122 | + | 
|  | 123 | +See the [documentation](https://pytorch-datastream.readthedocs.io/en/latest/) for more usage examples. | 
|  | 124 | +```` | 
0 commit comments