This repository is an implementation for building a Large Language Model (LLM) from scratch using Python. The goal is to figure out the inner workings of LLMs by implementing core components and techniques used in natural language processing (NLP).
A Bigram Language Model is a simple probabilistic model that predicts the next word in a sequence based on the previous word. It assumes that the probability of a word depends only on the word that came immediately before it.
A bigram refers to a pair of consecutive words in a sentence or text. For example:
📌 Sentence: "I love machine learning"
📌 Bigrams: ("I", "love")
, ("love", "machine")
, ("machine", "learning")
A bigram language model estimates P(w₂ | w₁), the probability of a word w₂ given the previous word w₁.
A bigram model learns the conditional probabilities of word sequences from a corpus:
where:
-
$( C(w_{n-1}, w_n))$ = Count of bigram$(w_{n-1}, w_n)$ -
$( C(w_{n-1}))$ = Count of word$w_{n-1}$ in the corpus
For the word "hello", we compute bigram probabilities by counting occurrences in a corpus:
Bigram | Count | Total Count of Previous Character | Probability |
---|---|---|---|
h → e | 1 | 1 (h appears once) | 1/1 = 1.000 |
e → l | 1 | 1 (e appears once) | 1/1 = 1.000 |
l → l | 1 | 2 (l appears twice) | 1/2 = 0.500 |
l → o | 1 | 2 (l appears twice) | 1/2 = 0.500 |
Using the learned bigram probabilities, we generate "hello"
step by step:
- Start with
"h"
- P(e | h) = 1.000 →
"he"
- P(l | e) = 1.000 →
"hel"
- P(l | l) = 0.500 →
"hell"
- P(o | l) = 0.500 →
"hello"