diff --git a/Transformers/GPT/README.md b/Transformers/GPT/README.md index 830a792..1184ac1 100644 --- a/Transformers/GPT/README.md +++ b/Transformers/GPT/README.md @@ -6,6 +6,22 @@ GPT is the Transformer based model that is proposed by OpenAI. OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training [1]](./papers/gpt.pdf) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It’s a causal (unidirectional) transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus. +- GPT-1 largely follows the original transformer work +- Authors trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). + - For the position-wise feed-forward networks, 3072 dimensional inner states has been used. +- Adam max learning rate of 2.5e-4. (later GPT-3 for this model size uses 6e-4) +- LR decay: increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule +- It has been trained for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. +- Since layernorm is used extensively throughout the model, a simple weight initialization of N(0, 0.02) was sufficient +- bytepair encoding (BPE) vocabulary with 40,000 merges +- residual, embedding, and attention dropouts with a rate of 0.1 for regularization. +- modified version of L2 regularization proposed in (37), with w = 0.01 on all non bias or gain weights +- For the activation function, we used the Gaussian Error Linear Unit (GELU). +- We used learned position embeddings instead of the sinusoidal version proposed in the original work +- For finetuning: We add dropout to the classifier with a rate of 0.1. learning rate of 6.25e-5 and a batchsize of 32. 3 epochs. + - A linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5. +- GPT-1 model is 12 layers and d_model 768, ~117M params + ### Tips for using GPT - GPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.