backProp

反傳遞演算法

在前文從微分到梯度下降法當中，我們說明了如何計算《偏微分與梯度》然後寫了一個《梯度下降法》，用來尋找函數的最低點。

如果使用數值的方式計算梯度，那麼就要計算 $\frac{f(x+h, y) - f(x, y)}{h}$, $\frac{f(x, y+h) - f(x, y)}{h}$ 這樣的方式，重複呼叫 f 幾次後達成的。

假如函數 f 的參數有 n 個，那麼要算出梯度，就必須重複的呼叫 n 次以上的 f 函數，因為每個變數都要計算偏微分，每次計算偏微分都要多算一次 f 函數。

換句話說，我們必須計算出每個 $x_i$ 的偏微分函數值 $\frac{\partial f}{\partial x_i}$ ，所以要計算下列 n 個 f 函數值

要算偏微分 $\frac{\partial f}{\partial x_1}$ ，就要先算出 $f(x_1+h, x_2,...., x_n)$
要算偏微分 $\frac{\partial f}{\partial x_2}$ ，就要先算出 $f(x_1, x_2+h, ...., x_n)$
...
要算偏微分 $\frac{\partial f}{\partial x_n}$ ，就要先算出 $f(x_1, x_2, ...., x_n+h)$

這樣當參數數量 n 很大的時候，梯度的計算就會變得很慢，因此我們必須想辦法加快《梯度的計算速度》。

這就是為何需要《反傳遞算法》的原因了！

從函數的觀點看反傳遞

反傳遞是根據鏈鎖規則的公式，所謂的鏈鎖規則，就是：

a = p(...,x,y,....)
f = q(...,a,...)

那麼就有下列鏈鎖法則

$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial a} \frac{\partial a}{\partial x} = q'(...,a,...) p'(...,x,y,...) $$

如果該網路有更多層，例如像下列程式

a = p(...,x,y,....)
b = q(...,a, ...)
f = r(...,b,....)

這樣的話，根據鏈鎖規則，可以得到

$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial b} \frac{\partial b}{\partial a} \frac{\partial a}{\partial x} = r'(...,b,...) q'(...,a,...) p'(...,x,y,...) $$

在計算梯度的時候，我們可以進行一次正向傳遞計算，算出 a,b,f 等等變數值。

然後再進行一次反向傳遞計算，算出各變數的梯度值 ( $\frac{\partial f}{\partial b}$ ， $\frac{\partial f}{\partial a}$ ， $\frac{\partial f}{\partial x}$ ， $\frac{\partial f}{\partial y}$ )

計算圖

一個函數，我們通常可以分解成一連串的《加減乘除》運算，然後繪製成計算圖。

舉例而言， f(x,y,z) = (x + y) * z ，我們可以將其繪製成下列計算圖

這時候如果我們用函數 (add, mul) 的方式改寫上述 f(x,y,z) 函數，就會變成

def f(x,y,z):
    q = add(x,y)
    return mul(q,z)

對於乘法，我們知道進行偏微分時，其梯度的計算方式為：

$\frac{\partial(f)}{\partial(z)}=q$

$\frac{\partial(f)}{\partial(q)}=z$

然後對於加法，我們知道進行偏微分時，其梯度的計算方式為：

$\frac{\partial(q)}{\partial(x)}=1$

$\frac{\partial(q)}{\partial(y)}=1$

於是透過鏈鎖規則，我們可以知道

$\frac{\partial(f)}{\partial(x)}=\frac{\partial(f)}{\partial(q)} \frac{\partial(q)}{\partial(x)} = z * 1$

這種透過鏈鎖規則，一層一層反傳遞，把函數 f 對各個參數 (x,y,z...) 的偏微分計算出來的方式，就稱為反傳遞算法。

在上述範例中，進行反傳遞算法時，我們必須先根據輸入，例如 (x=-2,y=5,z=-4) ，計算一次正傳遞，把計算圖中每個節點的輸出都算出來 (q=3,f=-12)。

然後在反傳遞時，我們就可以計算出各個變數的偏微分值，像是

$\frac{\partial(f)}{\partial(q)} = z = -4$

$\frac{\partial(f)}{\partial(z)} = q = 3$

接著繼續進行反傳遞，於是得知

$\frac{\partial(f)}{\partial(x)}=\frac{\partial(f)}{\partial(q)} \frac{\partial(q)}{\partial(x)} = z * 1 = -4 * 1 = -4$

$\frac{\partial(f)}{\partial(y)}=\frac{\partial(f)}{\partial(q)} \frac{\partial(q)}{\partial(y)} = z * 1 = -4 * 1 = -4$

把這些偏微分拼起來，就得到了梯度

$$ \nabla f(x,y,z) = \left[ \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y},\frac{\partial f}{\partial z}\right] $$

所以上述函數 f(x,y,z) 目前的梯度方向為 (-4,-4,3)。

通常在反傳遞的一開始，我們就設定輸出節點 f 的梯度為 1.0 然後才開始反傳遞，然後才一層一層的反傳遞得到下列結果。

倒數第0層
- f.grad = 1.0
倒數第1層
- z.grad = q.value * f.grad = 3 * 1 = 3
- q.grad = z.value * f.grad = -4 * 1 = -4
倒數第2層:
- x.grad = z.grad * 1 = -4
- y.grad = z.grad * 1 = -4

自製梯度引擎

一旦你理解了上述的反傳遞算法，就能自行設計一個梯度引擎。

舉例而言，我們可以透過宣告一個 Value 類別，然後用下列程式碼，輕輕鬆鬆的就能計算出梯度，因為反傳遞的任務全都交給梯度引擎了

https://github.com/cccbook/py2gpt/blob/master/03-backprop/01-micrograd0/ex0.py

from micrograd.engine import Value

x = Value(-2.0)
y = Value(5.0)
z = Value(-4.0)

q = x+y
f = q*z

print('====== forward ======')
print('x=', x)
print('y=', y)
print('z=', z)
print('q=', q)
print('f=', f)

print('====== backward ======')
f.backward()
print('x=', x)
print('y=', y)
print('z=', z)
print('q=', q)
print('f=', f)

'''
$ python ex0.py
====== forward ======
x= Value(data=-2.0, grad=0)
y= Value(data=5.0, grad=0)
z= Value(data=-4.0, grad=0)
q= Value(data=3.0, grad=0)
f= Value(data=-12.0, grad=0)
====== backward ======
x= Value(data=-2.0, grad=-4.0)
y= Value(data=5.0, grad=-4.0)
z= Value(data=-4.0, grad=3.0)
q= Value(data=3.0, grad=-4.0)
f= Value(data=-12.0, grad=1)
'''

Karpathy 的 micrograd 專案，就是這樣一個梯度引擎，這個專案的 engine.py 模組只有 95 行，但卻實作出了和 Pytorch/Tensorflow 類似的梯度引擎功能，雖然速度慢上許多，但是觀念原理卻非常清楚。

為了讓大家看得更清楚，我們將 micrograd 進一步簡化，只留下 +, * 等兩個運算後，列出原始碼如下：

https://github.com/cccbook/py2gpt/blob/master/03-backprop/01-micrograd0/micrograd/engine.py

class Value: # 含有梯度的值節點
    """ stores a single scalar value and its gradient """

    def __init__(self, data, _children=(), _op=''):
        self.data = data # 目前值
        self.grad = 0 # 梯度預設為 0
        # internal variables used for autograd graph construction
        self._backward = lambda: None # 反傳遞函數
        self._prev = set(_children) # 前面的網路節點
        self._op = _op # the op that produced this node, for graphviz / debugging / etc

    def __add__(self, other): # 加法的正向傳遞
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')

        def _backward(): # 加法的反向傳遞
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward

        return out

    def __mul__(self, other): # 乘法的正向傳遞
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')

        def _backward(): # 乘法的反向傳遞
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward

        return out

    def backward(self):
        # topological order all of the children in the graph
        topo = []
        visited = set()
        def build_topo(v): # 建立網路拓譜結構
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)

        # go one variable at a time and apply the chain rule to get its gradient
        self.grad = 1
        for v in reversed(topo): # 反向排列
            v._backward()

    def __repr__(self): # 轉字串 -- https://www.educative.io/edpresso/what-is-the-repr-method-in-python
        return f"Value(data={self.data}, grad={self.grad})"

您可以看到 Value 類別的乘法函數 __mul__ 被重新定義了，除了計算正向傳遞的結果外，也加入了 _backward 這個反向傳遞函數放入輸出 out._backward 當中，

    def __mul__(self, other): # 乘法的正向傳遞
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')

        def _backward(): # 乘法的反向傳遞
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward

        return out

一旦我們呼叫 f.backward() 時，就會使用下列程式碼，先對變數進行拓譜排序，然後按照反向拓譜順序，進行反傳遞的梯度值計算。

    def backward(self):
        # topological order all of the children in the graph
        topo = []
        visited = set()
        def build_topo(v): # 建立網路拓譜結構
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)

        # go one variable at a time and apply the chain rule to get its gradient
        self.grad = 1
        for v in reversed(topo): # 反向排列
            v._backward()

於是我們就不需要一直去想《微積分的反傳遞公式》這些傷腦筋的問題了，因為這些問題全部都交給梯度引擎去處理就好了。

我們只要負責寫出算式，然後呼叫反傳遞，就能得到梯度了！

當然，真實的深度學習套件，絕對不會只有《加法和乘法》，另外像是次方，還有 relu 這樣的非線性函數，都是必要的，以下是 micrograd 中的 relu 與次方之定義。

https://github.com/cccbook/py2gpt/blob/master/03a-MicroGrad/micrograd/engine.py

    def relu(self):
        out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')

        def _backward():
            self.grad += (out.data > 0) * out.grad # gx = 1 if f>0 else 0
        out._backward = _backward

        return out

由於覆蓋掉了 +, += 等運算，於是 sum 函數也可以直接使用，因此在 micrograd 的 nn.py 中就可以定義出下列神經元模組

https://github.com/cccbook/py2gpt/blob/master/03a-MicroGrad/micrograd/nn.py

class Neuron(Module):

    def __init__(self, nin, nonlin=True):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(0)
        self.nonlin = nonlin

    def __call__(self, x):
        act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
        return act.relu() if self.nonlin else act

    def parameters(self):
        return self.w + [self.b]

    def __repr__(self):
        return f"{'ReLU' if self.nonlin else 'Linear'}Neuron({len(self.w)})"

接著就請大家仔細看看 micrograd 專案吧！

如果我們引入 numpy 套件，將多維陣列直接套用到 micrograd 的梯度引擎中，甚至可以做出《層》(Layer) 這樣的觀念，而現代的深度學習套件，幾乎都是《一層又一層》的堆疊方式，這些層的輸入與輸出，通常都是 tensor (張量: 就是 n 維陣列)，因此 Google 的深度學習套件才會命名為 Tensorflow 。

為了支援《層》的觀念與張量，我們複製了 newcodevelop 改過的 micrograd 版本，並將 micrograd 改名為 macrograd ，這樣就能夠用簡單的層次堆疊，建構出複雜的深度神經網路了。

舉例而言，若要使用 macrograd 來建構一個多層感知器 (MLP)，學習如何辨認 MNIST 資料庫裡的手寫數字，可以用下列程式：

https://github.com/cccbook/py2gpt/blob/master/03b-MacroGrad/macrograd.py

from macrograd import Tensor

from keras.datasets import mnist
import keras
import numpy as np

(x_train, y_train), (x_test, y_test) = mnist.load_data()
train_images = np.asarray(x_train, dtype=np.float32) / 255.0
test_images = np.asarray(x_test, dtype=np.float32) / 255.0
train_images = train_images.reshape(60000, 784)
test_images = test_images.reshape(10000, 784)
y_train = keras.utils.to_categorical(y_train)

def calculate_loss(X, Y, W):
    return -(1/X.shape[0])*np.sum(np.sum(Y*np.log(np.exp(np.matmul(X, W)) / np.sum(np.exp(np.matmul(X, W)), axis=1)[:, None]), axis=1))

batch_size = 32
steps = 20000
# new initialized weights for gradient descent
Wb = Tensor(np.random.randn(784, 10))
for step in range(steps):
    ri = np.random.permutation(train_images.shape[0])[:batch_size]
    Xb, yb = Tensor(train_images[ri]), Tensor(y_train[ri])
    y_predW = Xb.matmul(Wb)
    probs = y_predW.softmax()
    log_probs = probs.log()

    zb = yb*log_probs

    outb = zb.reduce_sum(axis=1)
    finb = -outb.reduce_sum()  # cross entropy loss
    finb.backward()
    if step % 1000 == 0:
        loss = calculate_loss(train_images, y_train, Wb.data)
        print(f'loss in step {step} is {loss}')
    Wb.data = Wb.data - 0.01*Wb.grad # update weights, 相當於 optimizer.step()
    Wb.grad = 0

loss = calculate_loss(train_images, y_train, Wb.data)
print(f'loss in final step {step+1} is {loss}')

'''
$ python mnist.py
loss in step 0 is 17.92461998035962
loss in step 1000 is 0.6774682437270458
loss in step 2000 is 0.5477999727070898
loss in step 3000 is 0.4989018310540714
loss in step 4000 is 0.44975912129728784
loss in step 5000 is 0.4203744240123638
loss in step 6000 is 0.411572470806632
loss in step 7000 is 0.39024896870370307
loss in step 8000 is 0.3769509563066923
loss in step 9000 is 0.3768737910199006
loss in step 10000 is 0.3587365915719602
loss in step 11000 is 0.34737773487143964
loss in step 12000 is 0.37459058392841504
loss in step 13000 is 0.3411812984086028
loss in step 14000 is 0.32203538914621826
loss in step 15000 is 0.32873441998466774
loss in step 16000 is 0.32793335403592927
loss in step 17000 is 0.31231447428300846
loss in step 18000 is 0.3133160779430901
loss in step 19000 is 0.3550552459097412
loss in final step 20000 is 0.31858205117532523
'''

結語

在前文從微分到梯度下降法當中，我們談到《數值型梯度計算》的速度，在參數很多時會很慢，問題是《深度學習神經網路》的參數量通常很大 (從百萬到千億等級)，因此不能用數值型的梯度算法。

反傳遞算法可以快速地計算出梯度，其反向傳遞的速度和正向傳遞幾乎一樣快，這讓大型神經網路的梯度能夠用目前的電腦進行計算。

有了快速的梯度計算，剩下的問題就是，甚麼樣的神經網路模型，才能有效的學會解答《自然語言和影像處理》等領域的問題，在深度學習技術被發展出來之後，人工智慧的研究者發現，卷積神經網路 (CNN) 可以有效辨識影像，而循環神經網路 (RNN) 則在自然語言的問題，像是《摘要、問答、翻譯》上表現不錯。

2018 年 Google 的一篇革命性論文 Attention is all you need 出現後，大家發現使用 Attention/Transformer 的模型比使用 RNN 的模型更快又表現更好，於是 Attention/Transformer 就成了新一代 seq2seq 技術的主流。

2022 年底 ChatGPT 出現之後，讓大家開始注意到人工智慧技術，竟然已經進步到可以和人類對答如流的程度。

而 ChatGPT 背後的 GPT 技術，採用的正是 Transformer 中的 Decoder 模組，一樣是用 Attention 模型，這是我們接下來要學習的主題！

陳鍾誠於金門大學資訊工程系 -- 本書衍生自維基百科與 Karpathy 的 micrograd 與 minGPT ，採用 CC: BY-SA 授權

backProp

反傳遞演算法

從函數的觀點看反傳遞

計算圖

自製梯度引擎

結語

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!