You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/repository_structure.mdx
+26-30Lines changed: 26 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,17 +24,15 @@ In this simple case, you'll get a dataset with two splits: `train` (containing e
24
24
25
25
## Splits
26
26
27
-
If you have multiple files and want to define which file goes into which split, you can use the YAML `configs` field at the top of your README.md using glob patterns.
27
+
If you have multiple files and want to define which file goes into which split, you can use the YAML `configs` field at the top of your README.md.
28
28
29
29
For example, given a repository like this one:
30
30
31
31
```
32
32
my_dataset_repository/
33
33
├── README.md
34
-
├── directory1/
35
-
│ └── bees.csv
36
-
└── directory2/
37
-
└── more_bees.csv
34
+
├── data.csv
35
+
└── holdout.csv
38
36
```
39
37
40
38
You can define your splits by adding the `configs` field in the YAML block at the top of your README.md:
@@ -45,27 +43,23 @@ configs:
45
43
- config_name: default
46
44
data_files:
47
45
- split: train
48
-
pattern: "directory1/*.csv"
46
+
path: "data.csv"
49
47
- split: test
50
-
pattern: "directory2/*.csv"
48
+
path: "holdout.csv"
51
49
---
52
50
```
53
51
54
-
<Tipwarning={true}>
55
-
Note that `config_name` field is required even if you have a single configuration.
56
-
</Tip>
57
52
58
-
Having several patterns per split is also supported:
53
+
You can select multiple files per split using a list of paths:
59
54
60
55
```
61
56
my_dataset_repository/
62
57
├── README.md
63
-
├── directory1/
64
-
│ └── bees.csv
65
-
├── directory1bis/
66
-
│ └── more_bees.csv
67
-
└── directory2/
68
-
└── even_more_bees.csv
58
+
├── data/
59
+
│ ├── abc.csv
60
+
│ └── def.csv
61
+
└── holdout/
62
+
└── ghi.csv
69
63
```
70
64
71
65
```yaml
@@ -74,32 +68,34 @@ configs:
74
68
- config_name: default
75
69
data_files:
76
70
- split: train
77
-
pattern:
78
-
- "directory1/*.csv"
79
-
- "directory1bis/*.csv"
71
+
path:
72
+
- "data/abc.csv"
73
+
- "data/def.csv"
80
74
- split: test
81
-
pattern:
82
-
- "directory2/*.csv"
75
+
path: "holdout/ghi.csv"
83
76
---
84
77
```
85
78
86
-
Custom split names are also supported:
79
+
Or you can use glob patterns to automatically list all the files you need:
80
+
87
81
```yaml
82
+
---
88
83
configs:
89
84
- config_name: default
90
85
data_files:
91
-
- split: random
92
-
pattern:
93
-
- "directory1bis/*.csv"
94
86
- split: train
95
-
pattern:
96
-
- "directory1/*.csv"
87
+
path: "data/*.csv"
97
88
- split: test
98
-
pattern:
99
-
- "directory2/*.csv"
89
+
path: "holdout/*.csv"
100
90
---
101
91
```
102
92
93
+
<Tipwarning={true}>
94
+
95
+
Note that `config_name` field is required even if you have a single configuration.
96
+
97
+
</Tip>
98
+
103
99
## Configurations
104
100
105
101
Your dataset might have several subsets of data that you want to be able to load separately. In that case you can define a list of configurations inside the `configs` field in YAML:
0 commit comments