You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Probe: A Performance and Stability Diagnostic Tool for AI Applications
1
+
### Probing: A Performance and Stability Diagnostic Tool for AI Applications
2
2
3
-
Probe is a performance and stability diagnostic tool designed specifically for AI applications. It aims to solve the debugging and optimization challenges of large-scale, distributed, long-duration heterogeneous computing tasks (such as LLM training and inference). By injecting a probe server into the target process, it can collect more detailed performance data or modify the execution behavior of the target process in real-time.
3
+
Probing is a performance and stability diagnostic tool designed specifically for AI applications. It aims to solve the debugging and optimization challenges of large-scale, distributed, long-duration heterogeneous computing tasks (such as LLM training and inference). By injecting a probing server into the target process, it can collect more detailed performance data or modify the execution behavior of the target process in real-time.
4
4
5
5
## Key Features
6
6
7
-
The main features of Probe include:
7
+
The main features of Probing include:
8
8
9
9
-**Debugging Capabilities**:
10
10
- Observing the call stack, Python objects, Torch Tensors, and modules of the target process;
@@ -16,16 +16,16 @@ The main features of Probe include:
16
16
- Providing HTTP interfaces to retrieve data and control the execution of the target process;
17
17
- Supporting remote injection of arbitrary Python code into the target process.
18
18
19
-
Compared to other debugging and diagnostic tools, `probe` is plug-and-play, allowing it to intrude into the target process at any time without interruption or restart, and without modifying the code.
19
+
Compared to other debugging and diagnostic tools, `probing` is plug-and-play, allowing it to intrude into the target process at any time without interruption or restart, and without modifying the code.
20
20
21
21
## Quick Start
22
22
23
-
### Injecting the Probe
23
+
### Injecting the Probing
24
24
25
-
Use the following command to inject the probe:
25
+
Use the following command to inject the probing:
26
26
27
27
```shell
28
-
probe<pid> inject [OPTIONS]
28
+
probing<pid> inject [OPTIONS]
29
29
```
30
30
31
31
Options:
@@ -37,47 +37,47 @@ Options:
37
37
38
38
### Diagnosing Issues
39
39
40
-
After injecting the probe, you can use the commands provided by probe to diagnose issues:
40
+
After injecting the probing, you can use the commands provided by probing to diagnose issues:
41
41
42
42
-`dump`: Print the current call stack to locate process blockages and deadlocks:
43
43
44
44
```shell
45
-
probe<pid> dump
45
+
probing<pid> dump
46
46
```
47
47
48
48
-`pause`: Pause the process and start a remote debugging service:
49
49
50
50
```shell
51
-
probe<pid> pause [ADDRESS] # ADDRESS is optional, default is a random port
51
+
probing<pid> pause [ADDRESS] # ADDRESS is optional, default is a random port
52
52
nc 127.0.0.1 3344 # Use nc to connect to the debugging service
53
53
```
54
54
55
55
-`catch`: Take over error handling and start a remote service upon error:
56
56
57
57
```shell
58
-
probe<pid> catch
58
+
probing<pid> catch
59
59
```
60
60
61
61
-`listen`: Start the background debugging service:
62
62
63
63
```shell
64
-
probe<pid> listen [ADDRESS] # ADDRESS is optional, default is a random port
64
+
probing<pid> listen [ADDRESS] # ADDRESS is optional, default is a random port
65
65
nc 127.0.0.1 3344 # Use nc to connect to the debugging service
Probe also provides a series of Python analysis and diagnostic features for the development and debugging of large models:
90
+
Probing also provides a series of Python analysis and diagnostic features for the development and debugging of large models:
91
91
92
92
- Activity Analysis: Capture the current Python stack information of each thread;
93
93
- Debugging: Start Python remote debugging to debug the target process in VSCode;
94
94
- Profiling: Profile the execution of torch models;
95
95
- Inspection: Inspect Python objects, torch Tensors, and torch Modules;
96
96
97
-
These features can be accessed through a web interface. For example, specify the service address when injecting the probe:
97
+
These features can be accessed through a web interface. For example, specify the service address when injecting the probing:
98
98
99
99
```shell
100
-
probe<pid> inject -b -a 127.0.0.1:1234
100
+
probing<pid> inject -b -a 127.0.0.1:1234
101
101
```
102
102
103
103
Then, you can access the above features by opening `http://127.0.0.1:1234` in a browser.
104
104
105
-
##Installing Probe
105
+
##Installing Probing
106
106
107
107
### Binary Installation
108
108
109
-
`probe` does not require special installation. Simply download the release file, extract it, and execute. Users can optionally add probe to the $PATH environment variable.
109
+
`probing` does not require special installation. Simply download the release file, extract it, and execute. Users can optionally add probing to the $PATH environment variable.
110
110
111
111
### Building from Source
112
112
113
-
`probe` relies on the trunk tool for building. Install it using the following command, or skip this step if it is already installed:
113
+
`probing` relies on the trunk tool for building. Install it using the following command, or skip this step if it is already installed:
114
114
115
115
```shell
116
116
cargo install trunk
@@ -124,16 +124,16 @@ sh build.sh
124
124
125
125
### Development Mode
126
126
127
-
To facilitate development, probe packages Python scripts and the web app into libprobe.so. Repacking every time code is modified can significantly reduce efficiency, so manual building is recommended:
127
+
To facilitate development, probing packages Python scripts and the web app into libprobing.so. Repacking every time code is modified can significantly reduce efficiency, so manual building is recommended:
128
128
129
129
```shell
130
130
# Continuously build the web app
131
131
cd app
132
132
trunk watch --filehash false -d dist/
133
133
134
-
# Build probe and libprobe
134
+
# Build probing and libprobing
135
135
cargo b -p cli
136
136
cargo b
137
137
```
138
138
139
-
In debug mode, probe will automatically load the web app from the dist directory and Python scripts from src, eliminating the need for repacking.
139
+
In debug mode, probing will automatically load the web app from the dist directory and Python scripts from src, eliminating the need for repacking.
0 commit comments