From 59b8354d36098933435f49656d9dd0404f2fdd6d Mon Sep 17 00:00:00 2001
From: YeonwooSung <neos960518@gmail.com>
Date: Sat, 31 Dec 2022 17:20:39 +0900
Subject: [PATCH] Add descriptions for GPT-2 and GPT-3

---
 Transformers/GPT/README.md | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/Transformers/GPT/README.md b/Transformers/GPT/README.md
index 1184ac1..39f18ec 100644
--- a/Transformers/GPT/README.md
+++ b/Transformers/GPT/README.md
@@ -34,6 +34,14 @@ OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask L
 
 Basically, there were no much difference between the architecture of GPT and GPT-2 model. Simply, they just increase the size of the GPT model to generate GPT-2, so that they could train the model with much bigger dataset (40GB of text dataset). However, the GPT-2 model actually outperformed the original GPT model. Here, we could find that increasing the size of the neural network actually helps the model to increase the performance.
 
+- LayerNorm was moved to the input of each sub-block, similar to a pre-activation residual network
+- an additional layer normalization was added after the final self-attention block.
+- modified initialization which accounts for the accumulation on the residual path with model depth is used.
+- the vocabulary is expanded to 50,257
+- increase the context size from 512 to 1024 tokens
+- larger batch-size of 512 is used
+- GPT-2 used 48 layers and d_model 1600 (vs. original 12 layers and d_model 768). ~1.542B params
+
 ### What is GPT-2
 
 The GPT-2 is basically the next word prediction feature of a keyboard app, but one that is much larger and more sophisticated than what your phone has. The GPT-2 was trained on a massive 40GB dataset called WebText that the OpenAI researchers crawled from the internet as part of the research effort. To compare in terms of storage size, the keyboard app I use, SwiftKey, takes up 78MBs of space. The smallest variant of the trained GPT-2, takes up 500MBs of storage to store all of its parameters. The largest GPT-2 variant is 13 times the size so it could take up more than 6.5 GBs of storage space.
@@ -200,6 +208,21 @@ The most special thing in the GPT-3 is that the size of the model is extremely h
 
 ![Performance Comparison by size of model](./imgs/performance_comparison_by_size_of_models.png)
 
+- GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters).
+- GPT-1-like: 12 layers, 12 heads, d_model 768 (125M)
+- uses the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein
+- uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer
+- *Always* have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel
+- all models use a context window of nctx = 2048 tokens.
+- Adam with β1 = 0.9, β2 = 0.95, and eps = 10−8
+- All models use weight decay of 0.1 to provide a small amount of regularization.
+    * (NOTE: GPT-1 used 0.01 I believe, see above)
+- clip the global norm of the gradient at 1.0
+- Linear LR warmup over the first 375 million tokens.
+    - Then use cosine decay for learning rate down to 10% of its value, over 260 billion tokens.
+- gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size.
+- full 2048-sized time context window is always used, with a special END OF DOCUMENT token delimiter
+
 #### Pretrained Language Models could be used for downstream tasks
 
 The main point that the GPT-3 actually surprised everyone is that the GPT-3 did not use the finetuning. Since the size of the GPT-3 is too large, researchers decided to not to use the finetuning the model. This is because finetuning an extrememly large model is extremely hard. Instead finetuning the model, GPT-3 trained with self-supervised pretraining and in-context learning.