Skip to content

Commit

Permalink
init project
Browse files Browse the repository at this point in the history
  • Loading branch information
nooodles2023 committed Jun 3, 2024
1 parent e34bbd9 commit 638e54d
Show file tree
Hide file tree
Showing 7 changed files with 340 additions and 1 deletion.
108 changes: 108 additions & 0 deletions README-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
简体中文 | [English](README.md)

# 简介

Tensorlink 是一款基于CUDA API-Forwarding实现的分布式计算框架,当您的电脑没有显卡或显卡性能不足,Tensorlink可以帮助您轻松使用局域网内任意位置上的显卡资源。

# 示例
PS: 场景中展示的系统中没有实体显卡,使用tensolink链接处于另一个子网下的4090显卡

## 场景一:Stable Diffusion 使用远端显卡加速
图中展示了SD使用tensorlink进行计算的效果
![alt text](assets/3.gif)
## 场景二:Transformer LLM 使用远端显卡推理
图中展示了trnasformers框架使用tensorlink进行模型推理的效果
![alt text](assets/4.gif)
## 场景三:Cinma4D Octane插件 使用远端显卡渲染

# 路线图

- CUDA Runtime API Hook ✅
- CUDA Driver API Hook ✅
- CUDA cuBLAS Hook ✅
- CUDA cuDNN Hook ✅
- 支持客户端多进程 ✅
- 支持ZSTD数据压缩 ✅
- 支持Light(TCP+UDP)与Native(TCP+TCP)两种通信协议 ✅
- 支持单台服务器多卡计算
- 支持集群模式,整合多台机器的显卡资源

# 依赖

## Windows 客户端

1. 推荐安装python版本3.10
```python
https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe
```
2. 使用动态链接cuda runtime运行库的pytorch-2.1.2
```python
https://github.com/nvwacloud/tensorlink/releases/download/deps/torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
pip install torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
```
3. 安装Tensorlink CUDA依赖库,如果您的系统中已安装CUDA 12.1,可以跳过此步骤;
```python
https://github.com/nvwacloud/tensorlink/releases/download/deps/tensorlink_cuda_deps.zip
解压到任意目录后,将目录配置在系统的环境变量Path中;
```

## Linux 服务端

推荐系统:Rocky Linux 9.3 或 Ubuntu 2024.04

1. 安装CUDA 12.1
https://developer.nvidia.com/cuda-12-1-0-download-archive

2. 安装CUDNN 8.8.1;
https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz/

3. 安装ZSTD 1.5.5及以上版本
```python
wget https://github.com/facebook/zstd/releases/download/v1.5.6/zstd-1.5.6.tar.gz
tar -xf zstd-1.5.6.tar.gz
cd zstd-1.5.6
make && make install
```

# 安装
下载最新版本Tensorlink
```python
https://github.com/nvwacloud/tensorlink/releases/
```

## Windows 客户端
解压Tensorlink后,将client\windows目录下所有dll文件拷贝至system32目录下;
```python
cd client\windows
copy *.dll C:\Windows\System32
```
<br><b>注意:如果出现cuda相关的dll文件冲突,请自行备份原文件</b>

## Linux 服务端
解压Tensorlink后,将server\linux目录下所有文件拷贝至任意目录即可;

# 运行

## Linux 服务端(有显卡)
```python
./tensorlink -role server -net native -recv_port 9998 -send_port 9999
```
![alt text](assets/1.png)

## Windows 客户端
```python
tensorlink.exe -role client -ip 192.168.1.2 -net native -send_port 9998 -recv_port 9999
```
注意:服务端接收端口对应客户端发送端口,服务端发送端口对应客户端接收端口,两者端口以及协议必须一致。

## 检查运行状态
使用python命令行,导入pytorch库,查看是否有远端显卡信息。
![alt text](assets/2.png)

# 常见问题

1. <b>服务端启动时报错,缺少cudnn库文件</b>
<br/>请检查cudnn库文件是否正确安装,如果通过压缩包安装,需要设置库文件路径相关的环境变量或者将库文件拷贝至/lib64目录下,否则程序可能无法找到库文件。

2. <b>客户端程序无响应</b>
<br/>请检查客户端程序是否正确安装,vcuda主进程是否运行,可以通过DebugView进一步检查vcuda进程输出的相关信息。
123 changes: 122 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,122 @@
# tensorlink
English | [简体中文](README-CN.md)

# Introduction

Tensorlink is a distributed computing framework based on CUDA API-Forwarding. When your computer lacks a GPU or its GPU performance is insufficient, Tensorlink allows you to easily utilize GPU resources from any location within the local area network.

# Examples

Note: The system shown in the scenarios does not have a physical GPU. It uses Tensorlink to connect to a 4090 GPU in another subnet.


## Scenario 1: Stable Diffusion Accelerated by Remote GPU

The image shows the effect of SD using Tensorlink for computation.
![alt text](assets/3.gif)


## Scenario 2: Transformer LLM Inference Using Remote GPU

The image shows the effect of the transformers framework using Tensorlink for model inference.
![alt text](assets/4.gif)



## Scenario 3: Cinema4D Octane Plugin Remote GPU Rendering

# Roadmap

- CUDA Runtime API Hook ✅
- CUDA Driver API Hook ✅
- CUDA cuBLAS Hook ✅
- CUDA cuDNN Hook ✅
- Support for Client Multi-Process ✅
- Support for ZSTD Data Compression ✅
- Support for Light (TCP+UDP) and Native (TCP+TCP) Communication Protocols ✅
- Support for Multi-GPU Computing on a Single Server
- Support for Cluster Mode, Integrating GPU Resources from Multiple Machines

# Dependences

## Windows Client

1. It is recommended to install Python version 3.10
```python
https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe
```

2. Use the dynamically linked CUDA runtime library Pytorch-2.1.2
```python
https://github.com/nvwacloud/tensorlink/releases/download/deps/torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
pip install torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
```

3. Install Tensorlink CUDA dependency library. If CUDA 12.1 is already installed on your system, you can skip this step.

```python
https://github.com/nvwacloud/tensorlink/releases/download/deps/tensorlink_cuda_deps.zip
```
Extract to any directory and configure the directory in the system environment variable Path.

## Linux Server

Recommended System: Rocky Linux 9.3 or Ubuntu 2024.04

1. Install CUDA 12.1 <br>https://developer.nvidia.com/cuda-12-1-0-download-archive

2. Install CUDNN 8.8.1 <br>https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz/

3. Install ZSTD 1.5.5 or later

```python
wget https://github.com/facebook/zstd/releases/download/v1.5.6/zstd-1.5.6.tar.gz
tar -xf zstd-1.5.6.tar.gz
cd zstd-1.5.6
make && make install
```

# Installation

Download the latest version of Tensorlink
```python
https://github.com/nvwacloud/tensorlink/releases/
```

## Windows Client
After extracting, copy all DLL files from the client\windows directory to the system32 directory.
```python
cd client\windows
copy *.dll C:\Windows\System32
```
<br><b>Note: If there is a conflict with existing CUDA-related DLL files, please back up the original files manually.</b>

## Linux Server
After extracting, copy all files from the server\Linux directory to any directory.

# Running

## Linux Server (with GPU)
```python
./tensorlink -role server -net native -recv_port 9998 -send_port 9999
```
![alt text](assets/1.png)

## Windows Client

```python
tensorlink.exe -role client -ip 192.168.1.2 -net native -send_port 9998 -recv_port 9999
```

Note: The server's receiving port corresponds to the client's sending port, and the server's sending port corresponds to the client's receiving port. Both ports and protocols must be consistent.

## Check

Check if the program is running correctly <br>Use the Python command line to import the PyTorch library and check if there is remote GPU information.
![alt text](assets/2.png)


FAQs

1. <b>Error on server startup, missing CUDNN library files</b> <br>Please check if the CUDNN library files are installed correctly. If installed via a compressed package, you need to set the library file path related environment variables or copy the library files to the /lib64 directory; otherwise, the program may not find the library files.

2. <b>Client program unresponsive</b> <br>Please check if the client program is installed correctly, if the vcuda main process is running, and use DebugView to further check the output information of the vcuda process.
Binary file added assets/1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/3.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/4.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
110 changes: 110 additions & 0 deletions main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
/*
* Tensorlink
* Copyright (C) 2024 Andy <[email protected]>
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <https://www.gnu.org/licenses/>.
*/

package main

import (
"flag"
"fmt"
"os"
"os/exec"
)

const (
LightProtocol = "light"
NativeProtocol = "native"
)

func main() {
if len(os.Args) < 2 {
printUsage()
return
}

role := flag.String("role", "", "Role of the program: 'server' for server, 'client' for client")
sendPort := flag.String("send_port", "", "Send port")
recvPort := flag.String("recv_port", "", "Receive port")
protocol := flag.String("net", "", "Protocol type: 'light' or 'native'")
serverIP := flag.String("ip", "", "Server IP address (client mode only)")
flag.Parse()

switch *role {
case "server":
if *sendPort == "" || *recvPort == "" || *protocol == "" {
printServerUsage()
return
}
if *protocol != LightProtocol && *protocol != NativeProtocol {
printServerUsage()
return
}
startServer(*sendPort, *recvPort, *protocol)
case "client":
if *serverIP == "" || *sendPort == "" || *recvPort == "" || *protocol == "" {
printClientUsage()
return
}
if *protocol != LightProtocol && *protocol != NativeProtocol {
printClientUsage()
return
}
startClient(*serverIP, *sendPort, *recvPort, *protocol)
default:
printUsage()
}
}

func printUsage() {
fmt.Println("Usage: ")
fmt.Println(" -role server -net [protocol] -recv_port [receive port] -send_port [send port] Start as server")
fmt.Println(" -role client -ip [server ip] -net [protocol] -send_port [send port] -recv_port [receive port] Start as client")
}

func printServerUsage() {
fmt.Println("Server Usage: ")
fmt.Println(" -role server -net [protocol] -recv_port [receive port] -send_port [send port]")
fmt.Println("Example: ")
fmt.Println(" -role server -net native -recv_port 9998 -send_port 9999")
}

func printClientUsage() {
fmt.Println("Client Usage: ")
fmt.Println(" -role client -ip [server ip] -net [protocol] -send_port [send port] -recv_port [receive port]")
fmt.Println("Example: ")
fmt.Println(" -role client -ip 192.168.2.2 -net native -send_port 9998 -recv_port 9999")
}

func startServer(sendPort, recvPort, protocol string) {
args := []string{"-s", sendPort, "-r", recvPort, "-n", protocol}
runVCUDACommand("./vcuda", args...)
}

func startClient(serverIP, sendPort, recvPort, protocol string) {
args := []string{serverIP, protocol, sendPort, recvPort}
runVCUDACommand("./vcuda.exe", args...)
}

func runVCUDACommand(path string, args ...string) {
cmd := exec.Command(path, args...)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
err := cmd.Run()
if err != nil {
fmt.Printf("Failed to start vcuda: %v\n", err)
}
}

0 comments on commit 638e54d

Please sign in to comment.