-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e34bbd9
commit 638e54d
Showing
7 changed files
with
340 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
简体中文 | [English](README.md) | ||
|
||
# 简介 | ||
|
||
Tensorlink 是一款基于CUDA API-Forwarding实现的分布式计算框架,当您的电脑没有显卡或显卡性能不足,Tensorlink可以帮助您轻松使用局域网内任意位置上的显卡资源。 | ||
|
||
# 示例 | ||
PS: 场景中展示的系统中没有实体显卡,使用tensolink链接处于另一个子网下的4090显卡 | ||
|
||
## 场景一:Stable Diffusion 使用远端显卡加速 | ||
图中展示了SD使用tensorlink进行计算的效果 | ||
![alt text](assets/3.gif) | ||
## 场景二:Transformer LLM 使用远端显卡推理 | ||
图中展示了trnasformers框架使用tensorlink进行模型推理的效果 | ||
![alt text](assets/4.gif) | ||
## 场景三:Cinma4D Octane插件 使用远端显卡渲染 | ||
|
||
# 路线图 | ||
|
||
- CUDA Runtime API Hook ✅ | ||
- CUDA Driver API Hook ✅ | ||
- CUDA cuBLAS Hook ✅ | ||
- CUDA cuDNN Hook ✅ | ||
- 支持客户端多进程 ✅ | ||
- 支持ZSTD数据压缩 ✅ | ||
- 支持Light(TCP+UDP)与Native(TCP+TCP)两种通信协议 ✅ | ||
- 支持单台服务器多卡计算 | ||
- 支持集群模式,整合多台机器的显卡资源 | ||
|
||
# 依赖 | ||
|
||
## Windows 客户端 | ||
|
||
1. 推荐安装python版本3.10 | ||
```python | ||
https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe | ||
``` | ||
2. 使用动态链接cuda runtime运行库的pytorch-2.1.2 | ||
```python | ||
https://github.com/nvwacloud/tensorlink/releases/download/deps/torch-2.1.2+cu121-cp310-cp310-win_amd64.whl | ||
pip install torch-2.1.2+cu121-cp310-cp310-win_amd64.whl | ||
``` | ||
3. 安装Tensorlink CUDA依赖库,如果您的系统中已安装CUDA 12.1,可以跳过此步骤; | ||
```python | ||
https://github.com/nvwacloud/tensorlink/releases/download/deps/tensorlink_cuda_deps.zip | ||
解压到任意目录后,将目录配置在系统的环境变量Path中; | ||
``` | ||
|
||
## Linux 服务端 | ||
|
||
推荐系统:Rocky Linux 9.3 或 Ubuntu 2024.04 | ||
|
||
1. 安装CUDA 12.1 | ||
https://developer.nvidia.com/cuda-12-1-0-download-archive | ||
|
||
2. 安装CUDNN 8.8.1; | ||
https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz/ | ||
|
||
3. 安装ZSTD 1.5.5及以上版本 | ||
```python | ||
wget https://github.com/facebook/zstd/releases/download/v1.5.6/zstd-1.5.6.tar.gz | ||
tar -xf zstd-1.5.6.tar.gz | ||
cd zstd-1.5.6 | ||
make && make install | ||
``` | ||
|
||
# 安装 | ||
下载最新版本Tensorlink | ||
```python | ||
https://github.com/nvwacloud/tensorlink/releases/ | ||
``` | ||
|
||
## Windows 客户端 | ||
解压Tensorlink后,将client\windows目录下所有dll文件拷贝至system32目录下; | ||
```python | ||
cd client\windows | ||
copy *.dll C:\Windows\System32 | ||
``` | ||
<br><b>注意:如果出现cuda相关的dll文件冲突,请自行备份原文件</b> | ||
|
||
## Linux 服务端 | ||
解压Tensorlink后,将server\linux目录下所有文件拷贝至任意目录即可; | ||
|
||
# 运行 | ||
|
||
## Linux 服务端(有显卡) | ||
```python | ||
./tensorlink -role server -net native -recv_port 9998 -send_port 9999 | ||
``` | ||
![alt text](assets/1.png) | ||
|
||
## Windows 客户端 | ||
```python | ||
tensorlink.exe -role client -ip 192.168.1.2 -net native -send_port 9998 -recv_port 9999 | ||
``` | ||
注意:服务端接收端口对应客户端发送端口,服务端发送端口对应客户端接收端口,两者端口以及协议必须一致。 | ||
|
||
## 检查运行状态 | ||
使用python命令行,导入pytorch库,查看是否有远端显卡信息。 | ||
![alt text](assets/2.png) | ||
|
||
# 常见问题 | ||
|
||
1. <b>服务端启动时报错,缺少cudnn库文件</b> | ||
<br/>请检查cudnn库文件是否正确安装,如果通过压缩包安装,需要设置库文件路径相关的环境变量或者将库文件拷贝至/lib64目录下,否则程序可能无法找到库文件。 | ||
|
||
2. <b>客户端程序无响应</b> | ||
<br/>请检查客户端程序是否正确安装,vcuda主进程是否运行,可以通过DebugView进一步检查vcuda进程输出的相关信息。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,122 @@ | ||
# tensorlink | ||
English | [简体中文](README-CN.md) | ||
|
||
# Introduction | ||
|
||
Tensorlink is a distributed computing framework based on CUDA API-Forwarding. When your computer lacks a GPU or its GPU performance is insufficient, Tensorlink allows you to easily utilize GPU resources from any location within the local area network. | ||
|
||
# Examples | ||
|
||
Note: The system shown in the scenarios does not have a physical GPU. It uses Tensorlink to connect to a 4090 GPU in another subnet. | ||
|
||
|
||
## Scenario 1: Stable Diffusion Accelerated by Remote GPU | ||
|
||
The image shows the effect of SD using Tensorlink for computation. | ||
![alt text](assets/3.gif) | ||
|
||
|
||
## Scenario 2: Transformer LLM Inference Using Remote GPU | ||
|
||
The image shows the effect of the transformers framework using Tensorlink for model inference. | ||
![alt text](assets/4.gif) | ||
|
||
|
||
|
||
## Scenario 3: Cinema4D Octane Plugin Remote GPU Rendering | ||
|
||
# Roadmap | ||
|
||
- CUDA Runtime API Hook ✅ | ||
- CUDA Driver API Hook ✅ | ||
- CUDA cuBLAS Hook ✅ | ||
- CUDA cuDNN Hook ✅ | ||
- Support for Client Multi-Process ✅ | ||
- Support for ZSTD Data Compression ✅ | ||
- Support for Light (TCP+UDP) and Native (TCP+TCP) Communication Protocols ✅ | ||
- Support for Multi-GPU Computing on a Single Server | ||
- Support for Cluster Mode, Integrating GPU Resources from Multiple Machines | ||
|
||
# Dependences | ||
|
||
## Windows Client | ||
|
||
1. It is recommended to install Python version 3.10 | ||
```python | ||
https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe | ||
``` | ||
|
||
2. Use the dynamically linked CUDA runtime library Pytorch-2.1.2 | ||
```python | ||
https://github.com/nvwacloud/tensorlink/releases/download/deps/torch-2.1.2+cu121-cp310-cp310-win_amd64.whl | ||
pip install torch-2.1.2+cu121-cp310-cp310-win_amd64.whl | ||
``` | ||
|
||
3. Install Tensorlink CUDA dependency library. If CUDA 12.1 is already installed on your system, you can skip this step. | ||
|
||
```python | ||
https://github.com/nvwacloud/tensorlink/releases/download/deps/tensorlink_cuda_deps.zip | ||
``` | ||
Extract to any directory and configure the directory in the system environment variable Path. | ||
|
||
## Linux Server | ||
|
||
Recommended System: Rocky Linux 9.3 or Ubuntu 2024.04 | ||
|
||
1. Install CUDA 12.1 <br>https://developer.nvidia.com/cuda-12-1-0-download-archive | ||
|
||
2. Install CUDNN 8.8.1 <br>https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz/ | ||
|
||
3. Install ZSTD 1.5.5 or later | ||
|
||
```python | ||
wget https://github.com/facebook/zstd/releases/download/v1.5.6/zstd-1.5.6.tar.gz | ||
tar -xf zstd-1.5.6.tar.gz | ||
cd zstd-1.5.6 | ||
make && make install | ||
``` | ||
|
||
# Installation | ||
|
||
Download the latest version of Tensorlink | ||
```python | ||
https://github.com/nvwacloud/tensorlink/releases/ | ||
``` | ||
|
||
## Windows Client | ||
After extracting, copy all DLL files from the client\windows directory to the system32 directory. | ||
```python | ||
cd client\windows | ||
copy *.dll C:\Windows\System32 | ||
``` | ||
<br><b>Note: If there is a conflict with existing CUDA-related DLL files, please back up the original files manually.</b> | ||
|
||
## Linux Server | ||
After extracting, copy all files from the server\Linux directory to any directory. | ||
|
||
# Running | ||
|
||
## Linux Server (with GPU) | ||
```python | ||
./tensorlink -role server -net native -recv_port 9998 -send_port 9999 | ||
``` | ||
![alt text](assets/1.png) | ||
|
||
## Windows Client | ||
|
||
```python | ||
tensorlink.exe -role client -ip 192.168.1.2 -net native -send_port 9998 -recv_port 9999 | ||
``` | ||
|
||
Note: The server's receiving port corresponds to the client's sending port, and the server's sending port corresponds to the client's receiving port. Both ports and protocols must be consistent. | ||
|
||
## Check | ||
|
||
Check if the program is running correctly <br>Use the Python command line to import the PyTorch library and check if there is remote GPU information. | ||
![alt text](assets/2.png) | ||
|
||
|
||
FAQs | ||
|
||
1. <b>Error on server startup, missing CUDNN library files</b> <br>Please check if the CUDNN library files are installed correctly. If installed via a compressed package, you need to set the library file path related environment variables or copy the library files to the /lib64 directory; otherwise, the program may not find the library files. | ||
|
||
2. <b>Client program unresponsive</b> <br>Please check if the client program is installed correctly, if the vcuda main process is running, and use DebugView to further check the output information of the vcuda process. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
/* | ||
* Tensorlink | ||
* Copyright (C) 2024 Andy <[email protected]> | ||
* | ||
* This program is free software: you can redistribute it and/or modify | ||
* it under the terms of the GNU General Public License as published by | ||
* the Free Software Foundation, either version 3 of the License, or | ||
* (at your option) any later version. | ||
* | ||
* This program is distributed in the hope that it will be useful, | ||
* but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
* GNU General Public License for more details. | ||
* | ||
* You should have received a copy of the GNU General Public License | ||
* along with this program. If not, see <https://www.gnu.org/licenses/>. | ||
*/ | ||
|
||
package main | ||
|
||
import ( | ||
"flag" | ||
"fmt" | ||
"os" | ||
"os/exec" | ||
) | ||
|
||
const ( | ||
LightProtocol = "light" | ||
NativeProtocol = "native" | ||
) | ||
|
||
func main() { | ||
if len(os.Args) < 2 { | ||
printUsage() | ||
return | ||
} | ||
|
||
role := flag.String("role", "", "Role of the program: 'server' for server, 'client' for client") | ||
sendPort := flag.String("send_port", "", "Send port") | ||
recvPort := flag.String("recv_port", "", "Receive port") | ||
protocol := flag.String("net", "", "Protocol type: 'light' or 'native'") | ||
serverIP := flag.String("ip", "", "Server IP address (client mode only)") | ||
flag.Parse() | ||
|
||
switch *role { | ||
case "server": | ||
if *sendPort == "" || *recvPort == "" || *protocol == "" { | ||
printServerUsage() | ||
return | ||
} | ||
if *protocol != LightProtocol && *protocol != NativeProtocol { | ||
printServerUsage() | ||
return | ||
} | ||
startServer(*sendPort, *recvPort, *protocol) | ||
case "client": | ||
if *serverIP == "" || *sendPort == "" || *recvPort == "" || *protocol == "" { | ||
printClientUsage() | ||
return | ||
} | ||
if *protocol != LightProtocol && *protocol != NativeProtocol { | ||
printClientUsage() | ||
return | ||
} | ||
startClient(*serverIP, *sendPort, *recvPort, *protocol) | ||
default: | ||
printUsage() | ||
} | ||
} | ||
|
||
func printUsage() { | ||
fmt.Println("Usage: ") | ||
fmt.Println(" -role server -net [protocol] -recv_port [receive port] -send_port [send port] Start as server") | ||
fmt.Println(" -role client -ip [server ip] -net [protocol] -send_port [send port] -recv_port [receive port] Start as client") | ||
} | ||
|
||
func printServerUsage() { | ||
fmt.Println("Server Usage: ") | ||
fmt.Println(" -role server -net [protocol] -recv_port [receive port] -send_port [send port]") | ||
fmt.Println("Example: ") | ||
fmt.Println(" -role server -net native -recv_port 9998 -send_port 9999") | ||
} | ||
|
||
func printClientUsage() { | ||
fmt.Println("Client Usage: ") | ||
fmt.Println(" -role client -ip [server ip] -net [protocol] -send_port [send port] -recv_port [receive port]") | ||
fmt.Println("Example: ") | ||
fmt.Println(" -role client -ip 192.168.2.2 -net native -send_port 9998 -recv_port 9999") | ||
} | ||
|
||
func startServer(sendPort, recvPort, protocol string) { | ||
args := []string{"-s", sendPort, "-r", recvPort, "-n", protocol} | ||
runVCUDACommand("./vcuda", args...) | ||
} | ||
|
||
func startClient(serverIP, sendPort, recvPort, protocol string) { | ||
args := []string{serverIP, protocol, sendPort, recvPort} | ||
runVCUDACommand("./vcuda.exe", args...) | ||
} | ||
|
||
func runVCUDACommand(path string, args ...string) { | ||
cmd := exec.Command(path, args...) | ||
cmd.Stdout = os.Stdout | ||
cmd.Stderr = os.Stderr | ||
err := cmd.Run() | ||
if err != nil { | ||
fmt.Printf("Failed to start vcuda: %v\n", err) | ||
} | ||
} |