Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature](mluOpLogcumsumexp):add new op #1027

Open
wants to merge 77 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
b2ab7c1
LCSE
shouhoo May 21, 2024
a23619e
param_check
shouhoo May 22, 2024
74cdacd
design
shouhoo May 22, 2024
3d69a5c
logcumsumexp
shouhoo Jun 21, 2024
a25d1e1
logcumsumexp
shouhoo Jun 21, 2024
9eca45a
logcumsumexp
shouhoo Jun 21, 2024
54aac1f
logcumsumexp
shouhoo Jun 21, 2024
983525c
logcumsumexp
shouhoo Jun 21, 2024
3ff3d0a
logcumsumexp
shouhoo Jun 21, 2024
dda454e
logcumsumexp
shouhoo Jun 21, 2024
8269a05
logcumsumexp
shouhoo Jun 21, 2024
513d75e
logcumsumexp
shouhoo Jun 21, 2024
44846c8
logcumsumexp
shouhoo Jun 21, 2024
68b9f1a
logcumsumexp
shouhoo Jun 23, 2024
e889812
logcumsumexp
shouhoo Jun 23, 2024
feeb2a8
logcumsumexp
shouhoo Jun 23, 2024
8df65a3
logcumsumexp
shouhoo Jun 23, 2024
092fe6e
logcumsumexp
shouhoo Jun 23, 2024
c647dab
logcumsumexp
shouhoo Jun 23, 2024
f966453
logcumsumexp
shouhoo Jun 23, 2024
fcdd26a
logcumsumexp
shouhoo Jun 23, 2024
94562f2
logcumsumexp
shouhoo Jun 24, 2024
82bee65
logcumsumexp
shouhoo Jun 24, 2024
9c06d21
logcumsumexp
shouhoo Jun 24, 2024
5ac0066
logcumsumexp
shouhoo Jun 27, 2024
5301fb1
logcumsumexp
shouhoo Jun 28, 2024
224af3e
logcumsumexp
shouhoo Jun 28, 2024
07275c9
logcumsumexp
shouhoo Jun 28, 2024
e6e5f1e
logcumsumexp
shouhoo Jun 28, 2024
5aa8d88
logcumsumexp
shouhoo Jul 12, 2024
c3843d8
logcumsumexp
shouhoo Jul 12, 2024
6134553
logcumsumexp
shouhoo Jul 12, 2024
7128c77
logcumsumexp
shouhoo Jul 12, 2024
bc26813
logcumsumexp
shouhoo Jul 12, 2024
00f44ca
logcumsumexp
shouhoo Jul 12, 2024
89feeff
logcumsumexp
shouhoo Jul 12, 2024
ebe6de2
logcumsumexp
shouhoo Jul 12, 2024
b0c74cf
logcumsumexp
shouhoo Jul 12, 2024
ab6b421
logcumsumexp
shouhoo Jul 12, 2024
11798eb
logcumsumexp
shouhoo Jul 12, 2024
04ec410
logcumsumexp
shouhoo Jul 12, 2024
77c2e9b
logcumsumexp
shouhoo Jul 16, 2024
5d0d736
logcumsumexp
shouhoo Jul 16, 2024
c7dbdc0
logcumsumexp
shouhoo Jul 16, 2024
b3a1014
logcumsumexp
shouhoo Jul 16, 2024
44c5c9b
logcumsumexp
shouhoo Jul 19, 2024
624fc40
logcumsumexp
shouhoo Jul 19, 2024
dcf7a91
logcumsumexp
shouhoo Jul 24, 2024
f268206
logcumsumexp
shouhoo Jul 24, 2024
257fb4d
logcumsumexp
shouhoo Jul 24, 2024
9df472a
logcumsumexp
shouhoo Jul 24, 2024
9bae900
logcumsumexp
shouhoo Jul 24, 2024
3d55a82
logcumsumexp
shouhoo Jul 26, 2024
006635b
logcumsumexp
shouhoo Jul 26, 2024
12fc733
logcumsumexp
shouhoo Aug 16, 2024
a65a6d2
logcumsumexp
shouhoo Aug 16, 2024
15d8a36
logcumsumexp
shouhoo Aug 16, 2024
6757001
logcumsumexp
shouhoo Aug 16, 2024
12c8741
logcumsumexp
shouhoo Aug 16, 2024
d212033
logcumsumexp
shouhoo Aug 16, 2024
c22a0e0
logcumsumexp
shouhoo Aug 16, 2024
9e278d7
logcumsumexp
shouhoo Aug 16, 2024
d630457
logcumsumexp
shouhoo Aug 16, 2024
9c088f6
logcumsumexp
shouhoo Aug 16, 2024
590f648
logcumsumexp
shouhoo Aug 16, 2024
9a23622
logcumsumexp
shouhoo Aug 16, 2024
ce8fa88
logcumsumexp
shouhoo Aug 22, 2024
1e186aa
logcumsumexp
shouhoo Oct 21, 2024
44372f9
logcumsumexp
shouhoo Oct 21, 2024
0209e99
logcumsumexp
shouhoo Oct 21, 2024
f2b79ac
logcumsumexp
shouhoo Oct 21, 2024
eb73615
logcumsumexp
shouhoo Oct 21, 2024
1c686e5
logcumsumexp
shouhoo Oct 21, 2024
d119ed0
logcumsumexp
shouhoo Oct 21, 2024
281a61e
logcumsumexp
shouhoo Oct 21, 2024
612f2ef
logcumsumexp
shouhoo Oct 21, 2024
7ddf517
logcumsumexp
shouhoo Nov 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 227 additions & 0 deletions docs/design_docs/logcumsumexp/Logcumsumexp_design_doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# logcumsumexp算子开发设计方案

* #### 文档基本信息

| 算子名称 | border_align_forward |
| ----------- | -------------------- |
| 编制人/日期 | 徐小虎/2024-5-22 |

* #### 修改记录

| 版本号 | 修订人 | 修订日期 | 修订描述 |
| ------ | ------ | --------- | -------- |
| V1.0 | 徐小虎 | 2024-5-22 | 首次提交 |

* #### 内容描述

本文档为`logcumsumexp`算子的设计文档,包括需求分析、接口设计、方案设计、性能优化记录。

* #### 算子需求checklist


算子需求提出者需要`check`的部分如下:

- 1.1 算子需求分析
- 1.2 算子功能和应用场景描述
- 1.3 算子输入输出参数要求
- 1.4 算子限制
- 1.5 验收标准
- 2.2 接口设计
- 3.5 测试用例(需求提出者`check`算子需求表中所给规模是否列出)

## 1 需求分析

### 1.1 算子需求分析

| 算子功能简介 | 返回给定维度中输入张量每行的求和指数的对数 |
| ------------------------------------------------------------ | ------------------------------------------ |
| 需求来源 | Pyrorch |
| 应用网络 | / |
| 输入数据类型 | input: half/float,dim: int32_t |
| 输入Shape | input: 无要求 |
| 输入Layout | input: ARRAY |
| 输出数据类型 | result: 与input保持一致 |
| 输出Shape | result: 与input保持一致 |
| 输出Layout | result: ARRAY |
| 模式(可选) | 无 |
| 是否含有dim/axis等类似语义的参数且该参数支持负数/其他特殊处理 | 否 |
| 是否含有labels/index等类似语义的参数且该参数支持负数/界外情况/其他特殊处理 | 否 |
| 是否需要支持原位 | 否 |
| 是否需要支持stride机制 | 否 |
| 是否需要支持广播 | 否 |
| 0元素检查是否直接返回 | 否 |
| 其他特殊需求(在线量化,融合,转数提前等,可选) | 无 |
| 本次开发优先支持的规模/模式 | 无 |


### 1.2 算子功能和应用场景描述

`logcumsumexp` 算子在神经网络中的应用通常涉及到处理概率或概率分布的操作,根据给定的张量和维度返回给定维度中输入张量每行的求和指数的对数。

该算子的应用例:

```python
>>>input=torch.tensor([[ 1., 2., 3., 4.],
[ 5., 6., 7., 8.]])
>>>result=torch.logcumsumexp(input,1)
>>>print(result)
tensor([[1.0000, 2.3133, 3.4076, 4.4402],
[5.0000, 6.3133, 7.4076, 8.4402]])

```

### 1.3 算子输入输出参数要求

| 参数 | 语义 | 类型(输入/输出) | 支持类型 | 物理布局 | 规模约束 |
| ----------- | ----------------------------------- | ----------------- | ------------- | -------- | ----------------------------- |
| handle | 句柄,保存运行的上下文信息 | 输入 | mluOpHandle_t | / | / |
| input_desc | 输入input的描述信息 | 输入 | / | / | input的维数最大为8 |
| input | 输入数据,指向input的mlu地址的指针 | 输入 | half, float | ARRAY | / |
| dim | 进行logcumsumexp操作的目标维度 | 输入 | int32_t | scalar | / |
| result_desc | 输出result的描述信息 | 输入 | / | / | result的维数应与input保持一致 |
| result | 输出数据,指向result的mlu地址的指针 | 输出 | half, float | ARRAY | / |

### 1.4 算子限制

| 限制类型 | 详细说明 |
| ------------ | ---------------------------------------------------------- |
| 原位限制 | 不支持原位 |
| stride限制 | 不支持`stride`机制 |
| 广播限制 | 不支持广播 |
| 数据范围限制 | ____________________________________ |
| 数据类型限制 | 张量数据支持`half`、`float`,且`input`和`result`须保持一致 |

### 1.5 验收标准

#### 1.5.1 精度验收标准

- 精度验收标准:支持规模范围内达到(不限于典型规模 case):动态阈值 diff1, diff2 精度验收通过。

#### 1.5.2 性能验收标准

- 典型规模下达到:

良好: 算子 hw time 是竞品 v100 的 8 倍

及格: 算子 hw time 是竞品 v100 的 15 倍

- (此标准用于竞品实现是单算子实现, 若竞品使用算子拼接的方式实现, 需单独说明)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到

Copy link
Collaborator

@PetrelYy PetrelYy Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议修改为上述贴的链接中的表格形式,表格后文字补充370S4 在上述规模下时间开销需保持在 *** 几倍范围内

而不是

“良好: 算子 hw time 是竞品 v100 的 8 倍 ”, 什么是V100 8倍, 是超过还是低于?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该算子的典型规模有10个,两种数据类型一共有20行。这个表格放在测试报告里是否会更好?


## 2 算子接口设计

### 2.1 参考接口

- Pytorch

```c++
Tensor& _logcumsumexp_out_cuda(const Tensor& self,
int64_t dim,
Tensor& result);

```

### 2.2 接口设计

```c++
mluOpStatus_t MLUOP_WIN_API
mluOpLogcumsumexp(mluOpHandle_t handle,
const mluOpTensorDescriptor_t input_desc,
const void *input,
const mluOpTensorDescriptor_t result_desc,
const void *result,
const int32_t dim);

```

## 3 实现方案设计

### 3.1 实现方案

`tensor.shape`:

- `input` 维度任意(在bangc平台下最高为8)。
- `result`应与`input`保持一致。

**计算原理说明:**

以 `input=[[1.,2.,3.,4.],[5.,6.,7.,8.]]`,`dim=1`为例:

- 对`input`沿着制定维度,两两进行指数和求对数计算,即$$log({e^a} + {e^b}) $$。
- 如示例中,对沿着维度1,计算$$log({e^1} + {e^2}) $$,结果为2.3133,存入与2对应的位置;再计算$$log({e^{2.3133}} + {e^3}) $$,结果为3.4076,将结果存入与3对应的位置,以此类推,直到该行的最后一个元素。
- 对每一行进行上述操作,最终得到一个与`input`相同`shape`的输出张量:`result`。

**实现方案:**

由于目标张量的维数和目标维度的不同,我们分为四种情况并采取不同的策略进行处理:

- 当目标张量为一维张量(即向量)时;

- 当目标维度为张量的最高维时;

- 当目标维度为张量的最低维时;

- 其他。

对于第一种情况和第二种情况,我们的目标类似于实现prefix scan中的inclusive scan。

第一种情况,当目标张量为一维张量(即向量)时,我们直接将数据分块后载入不同cluster内的不同core,对每个元素求exp后进行累加求和,具体来说就是将目标数据转置之后进行逐行的向量加,再用最后一行算出每一列的补偿值,再通过cycle_add让每一行的元素获得补偿值,最后转置使数据回到原排列。在这之后,根据数据分配的情况进行core间及cluster间的补偿,最后进行log计算得到结果。

第二种情况,当目标维度为张量的最高维时,我们把输入数据看作以目标维度为宽,以更低维度的乘积为高的矩阵。当目标维度较小时,我们可以把若干行放入一个nram,通过转置后的逐行向加来求和;当目标维度较大时,我们可以把每一行看作一个batch,每个batch独立调用第一种情况(当目标张量为一维张量)的kernel来计算。

第三种情况,当目标维度为张量的最低维度时,我们把输入数据看作以目标维度为高,以更低维度的乘积为宽的矩阵。如果矩阵宽度较小,我们可以直接把它分成由若干行组成的块,每一块独立求和,然后前后各块之间进行计算补偿就行了;如果矩阵宽度较大,我们则选择以列为单位,按顺序先后处理不同列。

第四种情况,可以当做有若干的batch的第三种情况。

### 3.2 伪代码实现(可选)

### 3.3 拆分

### (任务拆分,多核拆分)

_____________________________

### 3.4 性能优化设计

为了使计算时间和数据传输时间相互覆盖,在各个kernel中加入了流水。

### 3.5 方案理论性能

_____________________________

### 3.6 可维护性设计

1、变量、函数和类名按照`MLU_OPS`命名规范,尽力做到只读名称而不需读注释就能读懂代码。

2、每个函数确保只实现一个功能,尽可能缩短函数的长度。

3、合理的防呆设计。

4、关键信息打印到log中。

### 3.7 测试用例设计

_____________________________

### 3.8 算子防呆检查

以下情形防呆报错并返回错误码`MLUOP_STATUS_BAD_PARAM`:

1、指针为空。

2、输入为0元素。

3、对数据类型做检查,数据类型不为half且不为float类型。

4、对`input`和`result`的维数做检查,维数不大于8。

5、`input`和`result`的维数不同或各维度大小不相等。

## 4 算子性能优化记录

### 4.1 当前存在问题的规模说明

_____________________________

### 4.2 已经过优化的规模说明

_____________________________
84 changes: 84 additions & 0 deletions kernels/logcumsumexp/logcumsumexp.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
/*************************************************************************
* Copyright (C) [2019-2022] by Cambricon, Inc.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS for A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE for ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "logcumsumexp.h"
#include <string>
#include "core/context.h"
#include "core/gen_case.h"
#include "core/logging.h"
#include "core/runtime/device.h"
#include "core/tensor.h"
#include "core/type.h"

mluOpStatus_t MLUOP_WIN_API
mluOpLogcumsumexp(mluOpHandle_t handle,
const mluOpTensorDescriptor_t input_desc,
const void *input,
const mluOpTensorDescriptor_t result_desc,
void *result,
const int32_t dim) {
const std::string API = "[mluOpLogcumsumexp]";

cnrtFunctionType_t k_type;
cnrtDim3_t k_dim;

PARAM_CHECK(API, handle != NULL);
PARAM_CHECK(API, input_desc != NULL);
PARAM_CHECK(API, input != NULL);
PARAM_CHECK(API, result_desc != NULL);
PARAM_CHECK(API, result != NULL);
PARAM_CHECK(API, input_desc->layout == MLUOP_LAYOUT_ARRAY);
PARAM_CHECK(API, result_desc->layout == MLUOP_LAYOUT_ARRAY);

if (dim < 0) {
LOG(ERROR) << API
<< " dim cannot be a negative number. But received dim=["
<< dim << "]";
return MLUOP_STATUS_BAD_PARAM;
}
if (dim >= input_desc->dim) {
LOG(ERROR) << API
<< " dim beyonds the dimension of tensor. Received dim=["
<< dim << "]";
return MLUOP_STATUS_BAD_PARAM;
}

int32_t axis_size = input_desc->dims[dim];
int32_t higher_size = 1;
int32_t lower_size = 1;

for (int i = 0; i < dim; i++) {
lower_size *= input_desc->dims[i];
}

for (int i = dim+1; i < input_desc->dim; i++) {
higher_size *= input_desc->dims[i];
}

if (lower_size == 1 && higher_size == 1) {
k_type = CNRT_FUNC_TYPE_UNION8;
k_dim = {32, 1, 1};
} else if (higher_size == 1) {
k_type = CNRT_FUNC_TYPE_UNION8;
k_dim = {32, 1, 1};
} else if (lower_size == 1) {
k_type = CNRT_FUNC_TYPE_UNION1;
k_dim = {4, 1, 1};
} else {
k_type = CNRT_FUNC_TYPE_UNION8;
k_dim = {32, 1, 1};
}
CHECK_RETURN(API, KernelLogcumsumexp(
k_dim, k_type, handle->queue, input_desc->dtype,
input, result, axis_size, higher_size, lower_size));
GEN_CASE_END();
return MLUOP_STATUS_SUCCESS;
}
41 changes: 41 additions & 0 deletions kernels/logcumsumexp/logcumsumexp.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*******************************************************************************
* Copyright (C) [2023] by Cambricon, Inc.
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modif y, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be included
* in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS for A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS self.tcp LIABLE for ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*******************************************************************************/
#ifndef KERNELS_LOGCUMSUMEXP_LOGCUMSUMEXP_H
#define KERNELS_LOGCUMSUMEXP_LOGCUMSUMEXP_H

#include "mlu_op.h"
#include "kernels/debug.h"
#include "kernels/kernel.h"

mluOpStatus_t MLUOP_WIN_API
KernelLogcumsumexp(const cnrtDim3_t k_dim,
const cnrtFunctionType_t k_type,
const cnrtQueue_t queue,
mluOpDataType_t data_type,
const void *input,
void *result,
const int32_t axis_size,
const int32_t higher_size,
const int32_t lower_size);

#endif // KERNELS_LOGCUMSUMEXP_LOGCUMSUMEXP_H
Loading
Loading