init project

nvwacloud · Jun 3, 2024 · 638e54d · 638e54d
1 parent e34bbd9
commit 638e54d
Show file tree

Hide file tree

Showing 7 changed files with 340 additions and 1 deletion.
diff --git a/README-CN.md b/README-CN.md
@@ -0,0 +1,108 @@
+简体中文 | [English](README.md)
+
+# 简介
+
+Tensorlink 是一款基于CUDA API-Forwarding实现的分布式计算框架，当您的电脑没有显卡或显卡性能不足，Tensorlink可以帮助您轻松使用局域网内任意位置上的显卡资源。
+
+# 示例
+PS: 场景中展示的系统中没有实体显卡，使用tensolink链接处于另一个子网下的4090显卡
+
+## 场景一：Stable Diffusion 使用远端显卡加速
+图中展示了SD使用tensorlink进行计算的效果
+![alt text](assets/3.gif)
+## 场景二：Transformer LLM 使用远端显卡推理
+图中展示了trnasformers框架使用tensorlink进行模型推理的效果
+![alt text](assets/4.gif)
+## 场景三：Cinma4D Octane插件 使用远端显卡渲染
+
+# 路线图
+
+- CUDA Runtime API Hook ✅
+- CUDA Driver API Hook ✅
+- CUDA cuBLAS Hook ✅
+- CUDA cuDNN Hook ✅
+- 支持客户端多进程 ✅
+- 支持ZSTD数据压缩 ✅
+- 支持Light(TCP+UDP)与Native(TCP+TCP)两种通信协议 ✅
+- 支持单台服务器多卡计算
+- 支持集群模式，整合多台机器的显卡资源
+
+# 依赖
+
+## Windows 客户端
+
+1. 推荐安装python版本3.10
+    ```python
+    https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe
+    ```
+2. 使用动态链接cuda runtime运行库的pytorch-2.1.2
+    ```python
+    https://github.com/nvwacloud/tensorlink/releases/download/deps/torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
+    pip install torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
+    ```
+3. 安装Tensorlink CUDA依赖库，如果您的系统中已安装CUDA 12.1，可以跳过此步骤；
+    ```python
+    https://github.com/nvwacloud/tensorlink/releases/download/deps/tensorlink_cuda_deps.zip
+    解压到任意目录后，将目录配置在系统的环境变量Path中；
+    ```
+
+## Linux 服务端
+
+推荐系统：Rocky Linux 9.3 或 Ubuntu 2024.04
+
+1. 安装CUDA 12.1
+    https://developer.nvidia.com/cuda-12-1-0-download-archive
+
+2. 安装CUDNN 8.8.1；
+    https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz/
+
+3. 安装ZSTD 1.5.5及以上版本
+    ```python
+    wget https://github.com/facebook/zstd/releases/download/v1.5.6/zstd-1.5.6.tar.gz
+    tar -xf zstd-1.5.6.tar.gz
+    cd zstd-1.5.6
+    make && make install
+    ```
+
+# 安装
+下载最新版本Tensorlink
+```python
+https://github.com/nvwacloud/tensorlink/releases/
+``` 
+
+## Windows 客户端
+  解压Tensorlink后，将client\windows目录下所有dll文件拷贝至system32目录下；
+  ```python
+  cd client\windows
+  copy *.dll C:\Windows\System32
+  ```
+  <br><b>注意：如果出现cuda相关的dll文件冲突，请自行备份原文件</b>
+
+## Linux 服务端
+  解压Tensorlink后，将server\linux目录下所有文件拷贝至任意目录即可；
+
+# 运行
+
+## Linux 服务端（有显卡）
+```python
+./tensorlink -role server -net native -recv_port 9998 -send_port 9999
+```
+ ![alt text](assets/1.png)
+
+## Windows 客户端
+```python
+tensorlink.exe -role client -ip 192.168.1.2 -net native -send_port 9998 -recv_port 9999
+```
+  注意：服务端接收端口对应客户端发送端口，服务端发送端口对应客户端接收端口，两者端口以及协议必须一致。
+
+## 检查运行状态
+  使用python命令行，导入pytorch库，查看是否有远端显卡信息。
+  ![alt text](assets/2.png)
+
+# 常见问题
+
+1. <b>服务端启动时报错，缺少cudnn库文件</b>
+  <br/>请检查cudnn库文件是否正确安装，如果通过压缩包安装，需要设置库文件路径相关的环境变量或者将库文件拷贝至/lib64目录下，否则程序可能无法找到库文件。
+
+2. <b>客户端程序无响应</b>
+  <br/>请检查客户端程序是否正确安装，vcuda主进程是否运行，可以通过DebugView进一步检查vcuda进程输出的相关信息。
diff --git a/README.md b/README.md
@@ -1 +1,122 @@
-# tensorlink
+English | [简体中文](README-CN.md)
+
+# Introduction
+
+Tensorlink is a distributed computing framework based on CUDA API-Forwarding. When your computer lacks a GPU or its GPU performance is insufficient, Tensorlink allows you to easily utilize GPU resources from any location within the local area network.
+
+# Examples
+
+Note: The system shown in the scenarios does not have a physical GPU. It uses Tensorlink to connect to a 4090 GPU in another subnet.
+
+
+## Scenario 1: Stable Diffusion Accelerated by Remote GPU
+
+The image shows the effect of SD using Tensorlink for computation.
+![alt text](assets/3.gif)
+
+
+## Scenario 2: Transformer LLM Inference Using Remote GPU
+
+The image shows the effect of the transformers framework using Tensorlink for model inference.
+![alt text](assets/4.gif)
+
+
+
+## Scenario 3: Cinema4D Octane Plugin Remote GPU Rendering
+
+# Roadmap
+
+- CUDA Runtime API Hook ✅
+- CUDA Driver API Hook ✅
+- CUDA cuBLAS Hook ✅
+- CUDA cuDNN Hook ✅
+- Support for Client Multi-Process ✅
+- Support for ZSTD Data Compression ✅
+- Support for Light (TCP+UDP) and Native (TCP+TCP) Communication Protocols ✅
+- Support for Multi-GPU Computing on a Single Server
+- Support for Cluster Mode, Integrating GPU Resources from Multiple Machines
+
+# Dependences
+
+## Windows Client
+
+1. It is recommended to install Python version 3.10
+    ```python
+    https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe
+    ```
+
+2. Use the dynamically linked CUDA runtime library Pytorch-2.1.2
+    ```python
+    https://github.com/nvwacloud/tensorlink/releases/download/deps/torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
+    pip install torch-2.1.2+cu121-cp310-cp310-win_amd64.whl
+    ```
+
+3. Install Tensorlink CUDA dependency library. If CUDA 12.1 is already installed on your system, you can skip this step.
+
+    ```python
+    https://github.com/nvwacloud/tensorlink/releases/download/deps/tensorlink_cuda_deps.zip
+    ```
+    Extract to any directory and configure the directory in the system environment variable Path.
+
+## Linux Server
+
+Recommended System: Rocky Linux 9.3 or Ubuntu 2024.04
+
+1. Install CUDA 12.1 <br>https://developer.nvidia.com/cuda-12-1-0-download-archive
+
+2. Install CUDNN 8.8.1 <br>https://developer.nvidia.com/downloads/compute/cudnn/secure/8.8.1/local_installers/12.0/cudnn-linux-x86_64-8.8.1.3_cuda12-archive.tar.xz/
+
+3. Install ZSTD 1.5.5 or later
+
+    ```python
+    wget https://github.com/facebook/zstd/releases/download/v1.5.6/zstd-1.5.6.tar.gz
+    tar -xf zstd-1.5.6.tar.gz
+    cd zstd-1.5.6
+    make && make install
+    ```
+
+# Installation
+
+Download the latest version of Tensorlink
+```python
+https://github.com/nvwacloud/tensorlink/releases/
+```
+
+## Windows Client
+After extracting, copy all DLL files from the client\windows directory to the system32 directory. 
+  ```python
+  cd client\windows
+  copy *.dll C:\Windows\System32
+  ```
+<br><b>Note: If there is a conflict with existing CUDA-related DLL files, please back up the original files manually.</b>
+
+## Linux Server
+After extracting, copy all files from the server\Linux directory to any directory.
+
+# Running
+
+## Linux Server (with GPU)
+```python
+./tensorlink -role server -net native -recv_port 9998 -send_port 9999 
+```
+![alt text](assets/1.png)
+
+## Windows Client 
+
+```python
+tensorlink.exe -role client -ip 192.168.1.2 -net native -send_port 9998 -recv_port 9999
+```
+
+Note: The server's receiving port corresponds to the client's sending port, and the server's sending port corresponds to the client's receiving port. Both ports and protocols must be consistent.
+
+## Check
+
+Check if the program is running correctly <br>Use the Python command line to import the PyTorch library and check if there is remote GPU information.
+![alt text](assets/2.png)
+
+
+FAQs
+
+1. <b>Error on server startup, missing CUDNN library files</b> <br>Please check if the CUDNN library files are installed correctly. If installed via a compressed package, you need to set the library file path related environment variables or copy the library files to the /lib64 directory; otherwise, the program may not find the library files.
+
+2. <b>Client program unresponsive</b> <br>Please check if the client program is installed correctly, if the vcuda main process is running, and use DebugView to further check the output information of the vcuda process.
diff --git a/assets/1.png b/assets/1.png
diff --git a/assets/2.png b/assets/2.png
diff --git a/assets/3.gif b/assets/3.gif
diff --git a/assets/4.gif b/assets/4.gif
diff --git a/main.go b/main.go
@@ -0,0 +1,110 @@
+/*
+ * Tensorlink
+ * Copyright (C) 2024 Andy <[email protected]>
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <https://www.gnu.org/licenses/>.
+ */
+
+package main
+
+import (
+	"flag"
+	"fmt"
+	"os"
+	"os/exec"
+)
+
+const (
+	LightProtocol  = "light"
+	NativeProtocol = "native"
+)
+
+func main() {
+	if len(os.Args) < 2 {
+		printUsage()
+		return
+	}
+
+	role := flag.String("role", "", "Role of the program: 'server' for server, 'client' for client")
+	sendPort := flag.String("send_port", "", "Send port")
+	recvPort := flag.String("recv_port", "", "Receive port")
+	protocol := flag.String("net", "", "Protocol type: 'light' or 'native'")
+	serverIP := flag.String("ip", "", "Server IP address (client mode only)")
+	flag.Parse()
+
+	switch *role {
+	case "server":
+		if *sendPort == "" || *recvPort == "" || *protocol == "" {
+			printServerUsage()
+			return
+		}
+		if *protocol != LightProtocol && *protocol != NativeProtocol {
+			printServerUsage()
+			return
+		}
+		startServer(*sendPort, *recvPort, *protocol)
+	case "client":
+		if *serverIP == "" || *sendPort == "" || *recvPort == "" || *protocol == "" {
+			printClientUsage()
+			return
+		}
+		if *protocol != LightProtocol && *protocol != NativeProtocol {
+			printClientUsage()
+			return
+		}
+		startClient(*serverIP, *sendPort, *recvPort, *protocol)
+	default:
+		printUsage()
+	}
+}
+
+func printUsage() {
+	fmt.Println("Usage: ")
+	fmt.Println("  -role server -net [protocol] -recv_port [receive port] -send_port [send port]  Start as server")
+	fmt.Println("  -role client -ip [server ip] -net [protocol] -send_port [send port] -recv_port [receive port]  Start as client")
+}
+
+func printServerUsage() {
+	fmt.Println("Server Usage: ")
+	fmt.Println("  -role server -net [protocol] -recv_port [receive port] -send_port [send port]")
+	fmt.Println("Example: ")
+	fmt.Println("  -role server -net native -recv_port 9998 -send_port 9999")
+}
+
+func printClientUsage() {
+	fmt.Println("Client Usage: ")
+	fmt.Println("  -role client -ip [server ip] -net [protocol] -send_port [send port] -recv_port [receive port]")
+	fmt.Println("Example: ")
+	fmt.Println("  -role client -ip 192.168.2.2 -net native -send_port 9998 -recv_port 9999")
+}
+
+func startServer(sendPort, recvPort, protocol string) {
+	args := []string{"-s", sendPort, "-r", recvPort, "-n", protocol}
+	runVCUDACommand("./vcuda", args...)
+}
+
+func startClient(serverIP, sendPort, recvPort, protocol string) {
+	args := []string{serverIP, protocol, sendPort, recvPort}
+	runVCUDACommand("./vcuda.exe", args...)
+}
+
+func runVCUDACommand(path string, args ...string) {
+	cmd := exec.Command(path, args...)
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	err := cmd.Run()
+	if err != nil {
+		fmt.Printf("Failed to start vcuda: %v\n", err)
+	}
+}