Modern NLP with Hugging Face

This is a 4-week-long course that helps you understand Hugging Face's rich ecosystem and develop LLM models.

Week 1: Datasets

Slides
Notebook
Loading datasets in different formats
Understand the Structure of the loaded dataset
Access and manipulate samples in the dataset
- Concatenate
- Interleave
- Map
- Filter

Week 2: Tokenizers

Slides
Notebook
Set up a tokenization pipeline
Train the tokenizer
Encode the input samples (single or batch)
Test the implementation
Save and load the tokenizer
Decode the token_ids
Wrap the tokenizer with the PreTrainedTokenizer class
Save the pre-trained tokenizer

Week 3: Pre-training GPT-2

Slides
Notebook
Download the model checkpoints: Here
Set up the training pipeline
- Dataset: BookCorpus
- Number of tokens: 1.08 billion
- Tokenizer: gpt-2 tokenizer
- Model: gpt-2 with CLM head
- Optimizer: AdamW
- Parallelism: DDP (with L4 GPUs)
Train the model on
- A100 80 GB single GPU
- L4 48 GB single node Multiple GPUs
- V100 32 GB single GPU
Training Report at wandb
Text Generation

Week 4: Fine-tuning Llama 3.2 1B

Slides
Concepts:
- PEFT and Quantization
- Continued Pretraining of Llama 3.2 1B Notebook
- Task-specific fine-tuning (classification) Notebook
- Task-Specific fine-tuning with LoRA Notebook
- Instruction tuning Notebook from Unsloth
- Preference tuning

Optional Contents

Evaluate
- Slides
- CLI-Script
- We can use a CPU to evaluate the performance of language models on various benchmarks
- Let's evaluate the performance of GPT-2 model (of course, you can use any model from HF) on the MMLU benchmark
- Execute python eval_mmlu_cli.py --model gpt2 --num_samples 25 in the terminal.
- This will print the score to the console and also generate a plot as shown below
- Can you see the problem with averaging the scores?
- Evaluate OpenAI models on MMLU Script
- Important Note: Performance of Generative models is sensitive to subtle implementation details such as the format of prompts. Therefore, use LM-Harness for comparing the performance of models on standard academic benchmarks

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Notebooks		Notebooks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Modern NLP with Hugging Face

Week 1: Datasets

Week 2: Tokenizers

Week 3: Pre-training GPT-2

Week 4: Fine-tuning Llama 3.2 1B

Optional Contents

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Arunprakash-A/Modern-NLP-with-Hugging-Face

Folders and files

Latest commit

History

Repository files navigation

Modern NLP with Hugging Face

Week 1: Datasets

Week 2: Tokenizers

Week 3: Pre-training GPT-2

Week 4: Fine-tuning Llama 3.2 1B

Optional Contents

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages