Skip to content

Arunprakash-A/Modern-NLP-with-Hugging-Face

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 

Repository files navigation

Modern NLP with Hugging Face

  • This is a 4-week-long course that helps you understand Hugging Face's rich ecosystem and develop LLM models.

Week 1: Datasets

  • Slides
  • Notebook
  • Loading datasets in different formats
  • Understand the Structure of the loaded dataset
  • Access and manipulate samples in the dataset
    • Concatenate
    • Interleave
    • Map
    • Filter

Week 2: Tokenizers

  • Slides
  • Notebook
  • Set up a tokenization pipeline
  • Train the tokenizer
  • Encode the input samples (single or batch)
  • Test the implementation
  • Save and load the tokenizer
  • Decode the token_ids
  • Wrap the tokenizer with the PreTrainedTokenizer class
  • Save the pre-trained tokenizer

Week 3: Pre-training GPT-2

  • Slides
  • Notebook
  • Download the model checkpoints: Here
  • Set up the training pipeline
    • Dataset: BookCorpus
    • Number of tokens: 1.08 billion
    • Tokenizer: gpt-2 tokenizer
    • Model: gpt-2 with CLM head
    • Optimizer: AdamW
    • Parallelism: DDP (with L4 GPUs)
  • Train the model on
    • A100 80 GB single GPU
    • L4 48 GB single node Multiple GPUs
    • V100 32 GB single GPU
  • Training Report at wandb
  • Text Generation

Week 4: Fine-tuning Llama 3.2 1B

Optional Contents

  • Evaluate
    • Slides
    • CLI-Script
    • We can use a CPU to evaluate the performance of language models on various benchmarks
    • Let's evaluate the performance of GPT-2 model (of course, you can use any model from HF) on the MMLU benchmark
    • Execute python eval_mmlu_cli.py --model gpt2 --num_samples 25 in the terminal.
    • This will print the score to the console and also generate a plot as shown below
    • gpt2-evaluate
    • Can you see the problem with averaging the scores?
    • Evaluate OpenAI models on MMLU Script
    • Important Note: Performance of Generative models is sensitive to subtle implementation details such as the format of prompts. Therefore, use LM-Harness for comparing the performance of models on standard academic benchmarks

About

Notebooks for the DLP course by Prof.Mitesh Khapra and me, offered to IITM BS Students.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published