2
2
3
3
A ` Dataset[T] ` is a mapping that allows pipelining of functions in a readable syntax returning an example of type ` T ` .
4
4
5
+ <!-- pytest-codeblocks:importorskip(datastream)-->
6
+
5
7
``` python
6
8
from datastream import Dataset
7
9
@@ -25,15 +27,49 @@ assert dataset[2] == ('banana', 28)
25
27
26
28
## Class Methods
27
29
28
- ### from_subscriptable
30
+ ### ` from_subscriptable `
31
+
32
+ ``` python
33
+ from_subscriptable(data: Subscriptable[T]) -> Dataset[T]
34
+ ```
29
35
30
36
Create ` Dataset ` based on subscriptable i.e. implements ` __getitem__ ` and ` __len__ ` .
31
37
38
+ #### Parameters
39
+
40
+ - ` data ` : Any object that implements ` __getitem__ ` and ` __len__ `
41
+
42
+ #### Returns
43
+
44
+ - A new Dataset instance
45
+
46
+ #### Notes
47
+
32
48
Should only be used for simple examples as a ` Dataset ` created with this method does not support methods that require a source dataframe like ` Dataset.split ` and ` Dataset.subset ` .
33
49
34
- ### from_dataframe
50
+ ### ` from_dataframe `
51
+
52
+ ``` python
53
+ from_dataframe(df: pd.DataFrame) -> Dataset[pd.Series]
54
+ ```
55
+
56
+ Create ` Dataset ` based on ` pandas.DataFrame ` .
57
+
58
+ #### Parameters
35
59
36
- Create ` Dataset ` based on ` pandas.DataFrame ` . ` Dataset.__getitem__ ` will return a row from the dataframe and ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
60
+ - ` df ` : Source pandas DataFrame
61
+
62
+ #### Returns
63
+
64
+ - A new Dataset instance where ` __getitem__ ` returns a row from the dataframe
65
+
66
+ #### Notes
67
+
68
+ ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
69
+
70
+ #### Examples
71
+
72
+ <!-- pytest-codeblocks:importorskip(datastream)-->
37
73
38
74
``` python
39
75
import pandas as pd
@@ -49,10 +85,30 @@ dataset = (
49
85
assert dataset[- 1 ] == 4
50
86
```
51
87
52
- ### from_paths
88
+ ### ` from_paths `
89
+
90
+ ``` python
91
+ from_paths(paths: List[str ], pattern: str ) -> Dataset[pd.Series]
92
+ ```
53
93
54
94
Create ` Dataset ` from paths using regex pattern that extracts information from the path itself.
55
- ` Dataset.__getitem__ ` will return a row from the dataframe and ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
95
+
96
+ #### Parameters
97
+
98
+ - ` paths ` : List of file paths
99
+ - ` pattern ` : Regex pattern with named groups to extract information from paths
100
+
101
+ #### Returns
102
+
103
+ - A new Dataset instance where ` __getitem__ ` returns a row from the generated dataframe
104
+
105
+ #### Notes
106
+
107
+ ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
108
+
109
+ #### Examples
110
+
111
+ <!-- pytest-codeblocks:importorskip(datastream)-->
56
112
57
113
``` python
58
114
from datastream import Dataset
@@ -68,10 +124,26 @@ assert dataset[-1] == 'damage'
68
124
69
125
## Instance Methods
70
126
71
- ### map
127
+ ### ` map `
128
+
129
+ ``` python
130
+ map (self , function: Callable[[T], U]) -> Dataset[U]
131
+ ```
72
132
73
133
Creates a new dataset with the function added to the dataset pipeline.
74
134
135
+ #### Parameters
136
+
137
+ - ` function ` : Function to apply to each example
138
+
139
+ #### Returns
140
+
141
+ - A new Dataset with the mapping function added to the pipeline
142
+
143
+ #### Examples
144
+
145
+ <!-- pytest-codeblocks:importorskip(datastream)-->
146
+
75
147
``` python
76
148
from datastream import Dataset
77
149
@@ -83,11 +155,30 @@ dataset = (
83
155
assert dataset[- 1 ] == 4
84
156
```
85
157
86
- ### starmap
158
+ ### ` starmap `
159
+
160
+ ``` python
161
+ starmap(self , function: Callable[... , U]) -> Dataset[U]
162
+ ```
87
163
88
164
Creates a new dataset with the function added to the dataset pipeline.
165
+
166
+ #### Parameters
167
+
168
+ - ` function ` : Function that accepts multiple arguments unpacked from the pipeline output
169
+
170
+ #### Returns
171
+
172
+ - A new Dataset with the mapping function added to the pipeline
173
+
174
+ #### Notes
175
+
89
176
The dataset's pipeline should return an iterable that will be expanded as arguments to the mapped function.
90
177
178
+ #### Examples
179
+
180
+ <!-- pytest-codeblocks:importorskip(datastream)-->
181
+
91
182
``` python
92
183
from datastream import Dataset
93
184
@@ -100,11 +191,29 @@ dataset = (
100
191
assert dataset[- 1 ] == 7
101
192
```
102
193
103
- ### subset
194
+ ### ` subset `
195
+
196
+ ``` python
197
+ subset(self , function: Callable[[pd.DataFrame], pd.Series]) -> Dataset[T]
198
+ ```
199
+
200
+ Select a subset of the dataset using a function that receives the source dataframe as input.
104
201
105
- Select a subset of the dataset using a function that receives the source dataframe as input and is expected to return a boolean mask.
202
+ #### Parameters
106
203
107
- Note that this function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
204
+ - ` function ` : Function that takes a DataFrame and returns a boolean mask
205
+
206
+ #### Returns
207
+
208
+ - A new Dataset containing only the selected examples
209
+
210
+ #### Notes
211
+
212
+ This function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
213
+
214
+ #### Examples
215
+
216
+ <!-- pytest-codeblocks:importorskip(datastream)-->
108
217
109
218
``` python
110
219
import pandas as pd
@@ -121,9 +230,36 @@ dataset = (
121
230
assert dataset[- 1 ] == 2
122
231
```
123
232
124
- ### split
233
+ ### ` split `
234
+
235
+ ``` python
236
+ split(
237
+ self ,
238
+ key_column: str ,
239
+ proportions: Dict[str , float ],
240
+ stratify_column: Optional[str ] = None ,
241
+ filepath: Optional[str ] = None ,
242
+ seed: Optional[int ] = None ,
243
+ ) -> Dict[str , Dataset[T]]
244
+ ```
245
+
246
+ Split dataset into multiple parts.
247
+
248
+ #### Parameters
249
+
250
+ - ` key_column ` : Column to use as unique identifier for examples
251
+ - ` proportions ` : Dictionary mapping split names to proportions
252
+ - ` stratify_column ` : Optional column to use for stratification
253
+ - ` filepath ` : Optional path to save/load split configuration
254
+ - ` seed ` : Optional random seed for reproducibility
255
+
256
+ #### Returns
257
+
258
+ - Dictionary mapping split names to Dataset instances
125
259
126
- Split dataset into multiple parts. Optionally you can stratify on a column in the source dataframe or save the split to a json file.
260
+ #### Notes
261
+
262
+ Optionally you can stratify on a column in the source dataframe or save the split to a json file.
127
263
If you are sure that the split strategy will not change then you can safely use a seed instead of a filepath.
128
264
129
265
Saved splits can continue from the old split and handle:
@@ -133,6 +269,10 @@ Saved splits can continue from the old split and handle:
133
269
- Adapt after removing examples from dataset
134
270
- Adapt to new stratification
135
271
272
+ #### Examples
273
+
274
+ <!-- pytest-codeblocks:importorskip(datastream)-->
275
+
136
276
``` python
137
277
import numpy as np
138
278
import pandas as pd
@@ -154,9 +294,21 @@ assert len(split_datasets['train']) == 80
154
294
assert split_datasets[' test' ][0 ] == 3
155
295
```
156
296
157
- ### zip_index
297
+ ### ` zip_index `
298
+
299
+ ``` python
300
+ zip_index(self ) -> Dataset[Tuple[T, int ]]
301
+ ```
302
+
303
+ Zip the output with its underlying Dataset index.
158
304
159
- Zip the output with its underlying Dataset index. The output of the pipeline will be a tuple ` (output, index) ` .
305
+ #### Returns
306
+
307
+ - A new Dataset where each example is a tuple of ` (output, index) `
308
+
309
+ #### Examples
310
+
311
+ <!-- pytest-codeblocks:importorskip(datastream)-->
160
312
161
313
``` python
162
314
from datastream import Dataset
@@ -165,10 +317,26 @@ dataset = Dataset.from_subscriptable([4, 5, 6]).zip_index()
165
317
assert dataset[0 ] == (4 , 0 )
166
318
```
167
319
168
- ### cache
320
+ ### ` cache `
321
+
322
+ ``` python
323
+ cache(self , key_column: str ) -> Dataset[T]
324
+ ```
169
325
170
326
Cache intermediate step in-memory based on key column.
171
327
328
+ #### Parameters
329
+
330
+ - ` key_column ` : Column to use as cache key
331
+
332
+ #### Returns
333
+
334
+ - A new Dataset with caching enabled
335
+
336
+ #### Examples
337
+
338
+ <!-- pytest-codeblocks:importorskip(datastream)-->
339
+
172
340
``` python
173
341
import pandas as pd
174
342
from datastream import Dataset
@@ -178,12 +346,30 @@ dataset = Dataset.from_dataframe(df).cache('key')
178
346
assert dataset[0 ][' value' ] == 1
179
347
```
180
348
181
- ### concat
349
+ ### ` concat `
350
+
351
+ ``` python
352
+ concat(datasets: List[Dataset[T]]) -> Dataset[T]
353
+ ```
354
+
355
+ Concatenate multiple datasets together.
356
+
357
+ #### Parameters
358
+
359
+ - ` datasets ` : List of datasets to concatenate
360
+
361
+ #### Returns
362
+
363
+ - A new Dataset combining all input datasets
182
364
183
- Concatenate multiple datasets together so that they behave like a single dataset.
365
+ #### Notes
184
366
185
367
Consider using ` Datastream.merge ` if you have multiple data sources instead as it allows you to control the number of samples from each source in the training batches.
186
368
369
+ #### Examples
370
+
371
+ <!-- pytest-codeblocks:importorskip(datastream)-->
372
+
187
373
``` python
188
374
from datastream import Dataset
189
375
@@ -194,9 +380,29 @@ assert len(combined) == 4
194
380
assert combined[2 ] == 3
195
381
```
196
382
197
- ### combine
383
+ ### ` combine `
384
+
385
+ ``` python
386
+ combine(datasets: List[Dataset]) -> Dataset[Tuple]
387
+ ```
388
+
389
+ Zip multiple datasets together so that all combinations of examples are possible.
390
+
391
+ #### Parameters
392
+
393
+ - ` datasets ` : List of datasets to combine
394
+
395
+ #### Returns
396
+
397
+ - A new Dataset yielding tuples of all possible combinations
398
+
399
+ #### Notes
400
+
401
+ Creates tuples like ` (example1, example2, ...) ` for all possible combinations (i.e. the cartesian product).
402
+
403
+ #### Examples
198
404
199
- Zip multiple datasets together so that all combinations of examples are possible (i.e. the product) creating tuples like ` (example1, example2, ...) ` .
405
+ <!-- pytest-codeblocks:importorskip(datastream) -->
200
406
201
407
``` python
202
408
from datastream import Dataset
0 commit comments