Updated proposal pdf and blog image

Rohan-T144 · Rohan-T144 · commit 425aaa199b17 · 2025-05-21T19:42:26.000+10:00
diff --git a/.github/actions/spelling/allow/names.txt b/.github/actions/spelling/allow/names.txt
@@ -72,6 +72,7 @@ Svrin
 Tadel
 Taras
 Thessaloniki
+Timmaraju
 Universitat
 Unveristy
 Uppili
@@ -192,6 +193,7 @@ tapaswenipathak
 tfransham
 thakkar
 tharun
+timmaraju
 tlattner
 vaibhav
 vassil
diff --git a/.github/actions/spelling/allow/terms.txt b/.github/actions/spelling/allow/terms.txt
@@ -5,6 +5,7 @@ CMSSW
 Cppyy
 Debian
 GPGPU
+GPT
 GSo
 GSoC
 HSF
@@ -28,6 +29,7 @@ cytokine
 cytokines
 gitlab
 gsoc
+llm
 linkedin
 microenvironments
 pythonized
diff --git a/_data/contributors.yml b/_data/contributors.yml
@@ -322,7 +322,7 @@
     - title: "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation"
       status: Ongoing
       description: |
-        Training Large Language Models (LLMs) is computationally expensive, often bottlenecked by the performance limitations of Python-based frameworks. This project addresses this challenge by enhancing LLM training efficiency within a C++ environment through the integration of Clad, a Clang/LLVM compiler plugin for automatic differentiation (AD). We will deveop a custom C++ tensor library specifically designed for optimal interaction with Clad. The core objective is to replace traditional runtime or manual gradient computations with Clad's efficient compile-time differentiation for key LLM operations within a GPT-2 training pipeline. This involves investigating effective strategies to bridge Clad's static analysis with dynamic neural network computations, benchmarking the resulting performance gains in speed and memory usage against a non-Clad baseline, and leveraging OpenMP for further parallelization.
+        Training Large Language Models is computationally expensive, often limited by the performance limitations of Python-based frameworks. This project addresses this challenge by enhancing LLM training efficiency within a C++ environment through the integration of Clad, a Clang/LLVM compiler plugin for automatic differentiation (AD). We will develop a custom C++ tensor library specifically designed for optimal interaction with Clad. The core objective is to replace traditional runtime or manual gradient computations with Clad's efficient compile-time differentiation for key LLM operations within a GPT-2 training pipeline. This involves investigating effective strategies to bridge Clad's static analysis with dynamic neural network computations, benchmarking the resulting performance gains in speed and memory usage against a non-Clad baseline, and leveraging OpenMP for further parallelization.
       proposal: /assets/docs/Rohan_Timmaraju_Proposal_2025.pdf
       mentors: Vassil Vassilev, David Lange, Jonas Rembser, Christina Koutsou
 
diff --git a/_posts/2025-05-21-enhancing-llm-training.md b/_posts/2025-05-21-enhancing-llm-training.md
@@ -5,7 +5,7 @@ excerpt: "This GSoC project leverages Clad to optimize LLM training in C++, aimi
 sitemap: true
 author: Rohan Timmaraju
 permalink: blogs/gsoc25_rohan_introduction_blog/
-banner_image: /images/blog/gsoc-banner.png
+banner_image: /images/blog/LLM_project_banner.jpg
 date: 2025-05-21
 tags: gsoc c++ clang clad llm
 ---
@@ -29,7 +29,7 @@ This project proposes to tackle this challenge by integrating Clad, an Automatic
 To facilitate this integration, I am developing a custom C++ tensor library to be used in neural network training. Inspired by the powerful approaches of libraries such as [llm.c](https://github.com/karpathy/llm.c) and [pytorch](https://docs.pytorch.org/cppdocs/), this library is being designed from the ground up with Clad compatibility in mind. The core idea is to replace manual or internally managed gradient computations with Clad's reverse-mode AD (as in `clad::gradient`) for key LLM operations like matrix multiplications, activation functions, normalization layers, and the final loss function.
 
 ### Implementation Plan
-1. **Foundation & Baseline:** We'll start by implementing a complete GPT-2 training loop in C++ *without* Clad. This will serve as our performance baseline. GPT-2 is chosen here as a relatively simple open-source LLM architecture capable of being trained on local devices. This could be extended to other architectures like Llama or Mistral.
+1. **Foundation & Baseline:** The implementation will start by implementing a complete GPT-2 training loop in C++ *without* Clad. This will serve as our performance baseline. GPT-2 is chosen here as a relatively simple open-source LLM architecture capable of being trained on local devices. This could be extended to other architectures like Llama or Mistral.
 2. **Core Clad Integration Strategy:** We will investigate and evaluate different strategies for applying Clad to tensor network gradient calculations, potentially also identifying potential areas where Clad itself could be enhanced for deep learning workloads.
 3. **Expanding Integration:** Once a promising strategy is identified and validated on simpler operations, we'll systematically integrate Clad into more complex components of the GPT-2 architecture.
 4. **Benchmarking & Optimization:** Benchmarking against our baseline will be crucial to quantify the performance gains (speed, memory). We'll also use profiling tools to identify bottlenecks and optimize the tensor library with Clad. OpenMP may be employed for parallelization to further boost performance.
diff --git a/assets/docs/Rohan_Timmaraju_Proposal_2025.pdf b/assets/docs/Rohan_Timmaraju_Proposal_2025.pdf
diff --git a/images/blog/LLM_project_banner.jpg b/images/blog/LLM_project_banner.jpg