Skip to content

Commit e8025ab

Browse files
committed
doc: improve readability
1 parent 683cb5c commit e8025ab

File tree

5 files changed

+852
-369
lines changed

5 files changed

+852
-369
lines changed

datastream/datastream.py

+8-4
Original file line numberDiff line numberDiff line change
@@ -106,10 +106,14 @@ def merge(
106106
@staticmethod
107107
def zip(datastreams: List[Datastream]) -> Datastream[Tuple]:
108108
"""
109-
Zip multiple datastreams together so that all combinations of examples
110-
are possible (i.e. the product) creating tuples like
111-
``(example1, example2, ...)``. The samples are drawn independently
112-
from each underlying datastream.
109+
Zip multiple datastreams together so that samples are drawn independently
110+
from each underlying datastream, creating tuples like
111+
``(example1, example2, ...)``. The samples are drawn independently from
112+
each underlying datastream.
113+
114+
Note: This is different from ``Dataset.combine``, which creates all
115+
possible combinations (cartesian product) of examples. If you need all
116+
possible combinations, use ``Dataset.combine`` instead.
113117
"""
114118
return Datastream(
115119
Dataset.combine([datastream.dataset for datastream in datastreams]),

docs/dataset.md

+225-19
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
A `Dataset[T]` is a mapping that allows pipelining of functions in a readable syntax returning an example of type `T`.
44

5+
<!--pytest-codeblocks:importorskip(datastream)-->
6+
57
```python
68
from datastream import Dataset
79

@@ -25,15 +27,49 @@ assert dataset[2] == ('banana', 28)
2527

2628
## Class Methods
2729

28-
### from_subscriptable
30+
### `from_subscriptable`
31+
32+
```python
33+
from_subscriptable(data: Subscriptable[T]) -> Dataset[T]
34+
```
2935

3036
Create `Dataset` based on subscriptable i.e. implements `__getitem__` and `__len__`.
3137

38+
#### Parameters
39+
40+
- `data`: Any object that implements `__getitem__` and `__len__`
41+
42+
#### Returns
43+
44+
- A new Dataset instance
45+
46+
#### Notes
47+
3248
Should only be used for simple examples as a `Dataset` created with this method does not support methods that require a source dataframe like `Dataset.split` and `Dataset.subset`.
3349

34-
### from_dataframe
50+
### `from_dataframe`
51+
52+
```python
53+
from_dataframe(df: pd.DataFrame) -> Dataset[pd.Series]
54+
```
55+
56+
Create `Dataset` based on `pandas.DataFrame`.
57+
58+
#### Parameters
3559

36-
Create `Dataset` based on `pandas.DataFrame`. `Dataset.__getitem__` will return a row from the dataframe and `Dataset.map` should be given a function that takes a row from the dataframe as input.
60+
- `df`: Source pandas DataFrame
61+
62+
#### Returns
63+
64+
- A new Dataset instance where `__getitem__` returns a row from the dataframe
65+
66+
#### Notes
67+
68+
`Dataset.map` should be given a function that takes a row from the dataframe as input.
69+
70+
#### Examples
71+
72+
<!--pytest-codeblocks:importorskip(datastream)-->
3773

3874
```python
3975
import pandas as pd
@@ -49,10 +85,30 @@ dataset = (
4985
assert dataset[-1] == 4
5086
```
5187

52-
### from_paths
88+
### `from_paths`
89+
90+
```python
91+
from_paths(paths: List[str], pattern: str) -> Dataset[pd.Series]
92+
```
5393

5494
Create `Dataset` from paths using regex pattern that extracts information from the path itself.
55-
`Dataset.__getitem__` will return a row from the dataframe and `Dataset.map` should be given a function that takes a row from the dataframe as input.
95+
96+
#### Parameters
97+
98+
- `paths`: List of file paths
99+
- `pattern`: Regex pattern with named groups to extract information from paths
100+
101+
#### Returns
102+
103+
- A new Dataset instance where `__getitem__` returns a row from the generated dataframe
104+
105+
#### Notes
106+
107+
`Dataset.map` should be given a function that takes a row from the dataframe as input.
108+
109+
#### Examples
110+
111+
<!--pytest-codeblocks:importorskip(datastream)-->
56112

57113
```python
58114
from datastream import Dataset
@@ -68,10 +124,26 @@ assert dataset[-1] == 'damage'
68124

69125
## Instance Methods
70126

71-
### map
127+
### `map`
128+
129+
```python
130+
map(self, function: Callable[[T], U]) -> Dataset[U]
131+
```
72132

73133
Creates a new dataset with the function added to the dataset pipeline.
74134

135+
#### Parameters
136+
137+
- `function`: Function to apply to each example
138+
139+
#### Returns
140+
141+
- A new Dataset with the mapping function added to the pipeline
142+
143+
#### Examples
144+
145+
<!--pytest-codeblocks:importorskip(datastream)-->
146+
75147
```python
76148
from datastream import Dataset
77149

@@ -83,11 +155,30 @@ dataset = (
83155
assert dataset[-1] == 4
84156
```
85157

86-
### starmap
158+
### `starmap`
159+
160+
```python
161+
starmap(self, function: Callable[..., U]) -> Dataset[U]
162+
```
87163

88164
Creates a new dataset with the function added to the dataset pipeline.
165+
166+
#### Parameters
167+
168+
- `function`: Function that accepts multiple arguments unpacked from the pipeline output
169+
170+
#### Returns
171+
172+
- A new Dataset with the mapping function added to the pipeline
173+
174+
#### Notes
175+
89176
The dataset's pipeline should return an iterable that will be expanded as arguments to the mapped function.
90177

178+
#### Examples
179+
180+
<!--pytest-codeblocks:importorskip(datastream)-->
181+
91182
```python
92183
from datastream import Dataset
93184

@@ -100,11 +191,29 @@ dataset = (
100191
assert dataset[-1] == 7
101192
```
102193

103-
### subset
194+
### `subset`
195+
196+
```python
197+
subset(self, function: Callable[[pd.DataFrame], pd.Series]) -> Dataset[T]
198+
```
199+
200+
Select a subset of the dataset using a function that receives the source dataframe as input.
104201

105-
Select a subset of the dataset using a function that receives the source dataframe as input and is expected to return a boolean mask.
202+
#### Parameters
106203

107-
Note that this function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
204+
- `function`: Function that takes a DataFrame and returns a boolean mask
205+
206+
#### Returns
207+
208+
- A new Dataset containing only the selected examples
209+
210+
#### Notes
211+
212+
This function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
213+
214+
#### Examples
215+
216+
<!--pytest-codeblocks:importorskip(datastream)-->
108217

109218
```python
110219
import pandas as pd
@@ -121,9 +230,36 @@ dataset = (
121230
assert dataset[-1] == 2
122231
```
123232

124-
### split
233+
### `split`
234+
235+
```python
236+
split(
237+
self,
238+
key_column: str,
239+
proportions: Dict[str, float],
240+
stratify_column: Optional[str] = None,
241+
filepath: Optional[str] = None,
242+
seed: Optional[int] = None,
243+
) -> Dict[str, Dataset[T]]
244+
```
245+
246+
Split dataset into multiple parts.
247+
248+
#### Parameters
249+
250+
- `key_column`: Column to use as unique identifier for examples
251+
- `proportions`: Dictionary mapping split names to proportions
252+
- `stratify_column`: Optional column to use for stratification
253+
- `filepath`: Optional path to save/load split configuration
254+
- `seed`: Optional random seed for reproducibility
255+
256+
#### Returns
257+
258+
- Dictionary mapping split names to Dataset instances
125259

126-
Split dataset into multiple parts. Optionally you can stratify on a column in the source dataframe or save the split to a json file.
260+
#### Notes
261+
262+
Optionally you can stratify on a column in the source dataframe or save the split to a json file.
127263
If you are sure that the split strategy will not change then you can safely use a seed instead of a filepath.
128264

129265
Saved splits can continue from the old split and handle:
@@ -133,6 +269,10 @@ Saved splits can continue from the old split and handle:
133269
- Adapt after removing examples from dataset
134270
- Adapt to new stratification
135271

272+
#### Examples
273+
274+
<!--pytest-codeblocks:importorskip(datastream)-->
275+
136276
```python
137277
import numpy as np
138278
import pandas as pd
@@ -154,9 +294,21 @@ assert len(split_datasets['train']) == 80
154294
assert split_datasets['test'][0] == 3
155295
```
156296

157-
### zip_index
297+
### `zip_index`
298+
299+
```python
300+
zip_index(self) -> Dataset[Tuple[T, int]]
301+
```
302+
303+
Zip the output with its underlying Dataset index.
158304

159-
Zip the output with its underlying Dataset index. The output of the pipeline will be a tuple `(output, index)`.
305+
#### Returns
306+
307+
- A new Dataset where each example is a tuple of `(output, index)`
308+
309+
#### Examples
310+
311+
<!--pytest-codeblocks:importorskip(datastream)-->
160312

161313
```python
162314
from datastream import Dataset
@@ -165,10 +317,26 @@ dataset = Dataset.from_subscriptable([4, 5, 6]).zip_index()
165317
assert dataset[0] == (4, 0)
166318
```
167319

168-
### cache
320+
### `cache`
321+
322+
```python
323+
cache(self, key_column: str) -> Dataset[T]
324+
```
169325

170326
Cache intermediate step in-memory based on key column.
171327

328+
#### Parameters
329+
330+
- `key_column`: Column to use as cache key
331+
332+
#### Returns
333+
334+
- A new Dataset with caching enabled
335+
336+
#### Examples
337+
338+
<!--pytest-codeblocks:importorskip(datastream)-->
339+
172340
```python
173341
import pandas as pd
174342
from datastream import Dataset
@@ -178,12 +346,30 @@ dataset = Dataset.from_dataframe(df).cache('key')
178346
assert dataset[0]['value'] == 1
179347
```
180348

181-
### concat
349+
### `concat`
350+
351+
```python
352+
concat(datasets: List[Dataset[T]]) -> Dataset[T]
353+
```
354+
355+
Concatenate multiple datasets together.
356+
357+
#### Parameters
358+
359+
- `datasets`: List of datasets to concatenate
360+
361+
#### Returns
362+
363+
- A new Dataset combining all input datasets
182364

183-
Concatenate multiple datasets together so that they behave like a single dataset.
365+
#### Notes
184366

185367
Consider using `Datastream.merge` if you have multiple data sources instead as it allows you to control the number of samples from each source in the training batches.
186368

369+
#### Examples
370+
371+
<!--pytest-codeblocks:importorskip(datastream)-->
372+
187373
```python
188374
from datastream import Dataset
189375

@@ -194,9 +380,29 @@ assert len(combined) == 4
194380
assert combined[2] == 3
195381
```
196382

197-
### combine
383+
### `combine`
384+
385+
```python
386+
combine(datasets: List[Dataset]) -> Dataset[Tuple]
387+
```
388+
389+
Zip multiple datasets together so that all combinations of examples are possible.
390+
391+
#### Parameters
392+
393+
- `datasets`: List of datasets to combine
394+
395+
#### Returns
396+
397+
- A new Dataset yielding tuples of all possible combinations
398+
399+
#### Notes
400+
401+
Creates tuples like `(example1, example2, ...)` for all possible combinations (i.e. the cartesian product).
402+
403+
#### Examples
198404

199-
Zip multiple datasets together so that all combinations of examples are possible (i.e. the product) creating tuples like `(example1, example2, ...)`.
405+
<!--pytest-codeblocks:importorskip(datastream)-->
200406

201407
```python
202408
from datastream import Dataset

0 commit comments

Comments
 (0)