You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/contributing/README.md
+67-2Lines changed: 67 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ If you encounter a bug or have a feature request, please search [existing issues
12
12
13
13
You can also reach out for support in the `#sig-spyre` channel in the [vLLM Slack](https://inviter.co/vllm-slack) workspace.
14
14
15
-
## Developing
15
+
## Docs
16
16
17
17
### Building the docs with MkDocs
18
18
@@ -21,7 +21,7 @@ You can also reach out for support in the `#sig-spyre` channel in the [vLLM Slac
21
21
Install MkDocs along with the [plugins](https://github.com/vllm-project/vllm-spyre/blob/main/mkdocs.yaml) used in the vLLM Spyre documentation.
22
22
23
23
```bash
24
-
pip install -r docs/requirements-docs.txt
24
+
uv pip install -r docs/requirements-docs.txt
25
25
```
26
26
27
27
!!! note
@@ -118,6 +118,71 @@ Then, run the continuous batching tests:
118
118
python -m pytest -v -x tests/e2e -m cb
119
119
```
120
120
121
+
## Debugging
122
+
123
+
!!! tip
124
+
You can `oc edit` a pod and change the image without having the pod schedule to a different node. This can be useful for testing whether software or hardware is the issue.
125
+
126
+
- The script `/opt/sentient/bin/aiu-query-devices` in the pod can be used to see the connectivity between the `AIUs` on the machine. You can also infer this from environment variables with names like `AIU_TIER_\d_SET_\d_RANK_\d`.
127
+
128
+
-`SPYRE_DEVICES` can be used to select which devices will be selected for each `RANK`. This is similar to how `CUDA_VISIBLE_DEVICES` works for GPU.
129
+
130
+
!!! example
131
+
`0,2,4,6` will assign rank `0` to AIU index `0`, rank `1` to AIU index `2`, rank `2` to AIU index `4`, and rank `3` to AIU index `6`.
132
+
133
+
- An alternative is to use `AIU_WORLD_RANK_\d=0000:aa:00.0` to explicitly map ranks to `PCI` addresses (make sure there are no duplicates used at runtime).
134
+
135
+
- A bash script that uses `/opt/sentient/senlib/bin/senlib_unit_test` to check each `AIU` allocated to the pod to see if they work for a basic test:
`DTLOG_LEVEL=INFO` (piped to file) can help you see what device addresses are actually in use. Look for the string `Opened: SEN:VFIO`.
151
+
152
+
!!! tip
153
+
In order to stop massive log spew, this configuration is ideal:
154
+
```
155
+
export DTLOG_LEVEL=ERROR
156
+
export TORCH_SENDNN_LOG=CRITICAL
157
+
```
158
+
159
+
### Topology Aware Allocation
160
+
161
+
This section is specific to the AIU operator and scheduling workloads onto specific cards.
162
+
163
+
(TODO: link to docs once they exist)
164
+
165
+
- This mode supports users to request a special set of AIU cards based on `PCI` topology. By using this mode, we can guarantee to pick up AIU cards of a particular class in the node:
166
+
167
+
- `Tier0` provides a set of cards in the same `PCI` switch.
168
+
- `Tier1` provides a set of cards from at most one-hop away `PCI` switch.
169
+
- `Tier2` provides a set of cards from at most two-hops away `PCI` switch.
170
+
171
+
- Running a Multi AIU Job using `ibm.com/aiu_pf_tier0,tier1,tier2`:
172
+
173
+
- This resource type is used for picking up a topology aware card set, which is required to run tensor parallel (`TP`) workloads more effectively. By using `tierX` class resource, `TP` users can automatically get a best performing card setfor the workload.
174
+
175
+
- The maximum number of allocatable resources in each tier depends on the platform & cluster, but we can get up to:
176
+
177
+
- `Tier0` - `4` cards
178
+
- `Tier1` - `8` cards
179
+
- `Tier2` - `16` cards
180
+
181
+
- Devices in`tier0` can do`peer-to-peer (P2P) RDMA`, devices on different trees use `Host DMA` sharing files through `/dev/shm`.
182
+
183
+
!!! warning
184
+
If you request cards greater than the cards supported by the switch, the pod will never be scheduled. In the above example, if you specify `ibm.com/aiu_pf_tier0: 5`in your yaml, the pod will never be scheduled because the maximum set of cards in`tier0` was specified as `4`.
0 commit comments