Skip to content

Commit f999f61

Browse files
authored
Merge branch 'master' into presentation21_05
2 parents ec06b72 + c442f86 commit f999f61

14 files changed

+251
-0
lines changed

.github/actions/spelling/allow/names.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ Svrin
7474
Tadel
7575
Taras
7676
Thessaloniki
77+
Timmaraju
7778
Universitat
7879
Unveristy
7980
Uppili
@@ -197,6 +198,7 @@ tapaswenipathak
197198
tfransham
198199
thakkar
199200
tharun
201+
timmaraju
200202
tlattner
201203
torre
202204
vaibhav

.github/actions/spelling/allow/terms.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,15 @@ CINT
44
CMSSW
55
Cppyy
66
Debian
7+
EPC
78
GPGPU
9+
GPT
810
GSo
911
GSoC
1012
HSF
1113
JIT'd
1214
Jacobians
15+
LLMs
1316
LLVM
1417
NVIDIA
1518
NVMe
@@ -30,12 +33,15 @@ gitlab
3033
gridlay
3134
gsoc
3235
gpu
36+
jthread
37+
llm
3338
llvm
3439
pushforward
3540
linkedin
3641
microenvironments
3742
pythonized
3843
ramview
44+
reoptimize
3945
samtools
4046
sitemap
4147
softsusy

_data/contributors.yml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,30 @@
275275
potentially establishing a new standard for high-performance genomic data analysis.
276276
proposal: /assets/docs/Aditya_Pandey_GSoC2025.pdf
277277
mentors: Martin Vassilev, Jonas Rembser, Fons Rademakers, Vassil Vassilev
278+
279+
- name: Petro Mozil
280+
photo: PetroMozil.jpeg
281+
info: "Google Summer of Code 2025 Contributor"
282+
283+
education: "Bachelor of Computer Science, Ukrainian Catholic University, Ukraine"
284+
github: "https://github.com/pmozil"
285+
active: 1
286+
linkedin: "https://www.linkedin.com/in/petro-mozil-a94583170/"
287+
projects:
288+
- title: "Enabling support for STL concurrency primitives in CLAD"
289+
status: Ongoing
290+
description: |
291+
Clad recursively iterates over the syntax tree to check whether a given
292+
statement should be differentiated. Each function that is called from
293+
inside of a differentiated function should be differentiated as well,
294+
and so should any object method. The main issue for clad is that std::thread is an object,
295+
and thus as a type that should be differentiated However, std::thread shouldn’t
296+
be differentiated, the function inside of it should.
297+
Some of STL’s concurrency primitives face the same problem -
298+
the methods in them should not be differentiated,
299+
and only the location of where they were called should be preserved.
300+
proposal: /assets/docs/petro_mozil_promosal_GSoC_2025.pdf
301+
mentors: Martin Vassilev, David Lange
278302

279303
- name: Salvador de la Torre Gonzalez
280304
photo: salva_de_la_torre_gonzalez.jpg
@@ -310,6 +334,34 @@
310334
proposal: /assets/docs/de_la_torre_gonzalez_salvador_proposal_gsoc_2025.pdf
311335
mentors: Vassil Vassilev, Lukas Breitwieser
312336

337+
- name: Rohan Timmaraju
338+
photo: Rohan_Timmaraju.jpg
339+
info: "Google Summer of Code 2025 Contributor"
340+
341+
education: "B.S. Computer Science, Columbia University"
342+
github: "https://github.com/Rohan-T144"
343+
active: 1
344+
linkedin: "https://www.linkedin.com/in/rohan-timmaraju-650ba3221/"
345+
projects:
346+
- title: "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation"
347+
status: Ongoing
348+
description: |
349+
Training Large Language Models is computationally expensive, often
350+
limited by the performance limitations of Python-based frameworks. This
351+
project addresses this challenge by enhancing LLM training efficiency
352+
within a C++ environment through the integration of Clad, a Clang/LLVM
353+
compiler plugin for automatic differentiation (AD). We will develop a
354+
custom C++ tensor library specifically designed for optimal interaction
355+
with Clad. The core objective is to replace traditional runtime or
356+
manual gradient computations with Clad's efficient compile-time
357+
differentiation for key LLM operations within a GPT-2 training pipeline.
358+
This involves investigating effective strategies to bridge Clad's static
359+
analysis with dynamic neural network computations, benchmarking the
360+
resulting performance gains in speed and memory usage against a non-Clad
361+
baseline, and leveraging OpenMP for further parallelization.
362+
proposal: /assets/docs/Rohan_Timmaraju_Proposal_2025.pdf
363+
mentors: Vassil Vassilev, David Lange, Jonas Rembser, Christina Koutsou
364+
313365
- name: Abdelrhman Elrawy
314366
photo: Abdelrhman.jpg
315367
info: "Google Summer of Code 2025 Contributor"

_pages/team/petro-mozil.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: "Compiler Research - Team - Petro Mozil"
3+
layout: gridlay
4+
excerpt: "Compiler Research: Team members"
5+
sitemap: false
6+
permalink: /team/PetroMozil
7+
8+
---
9+
10+
{% include team-profile.html %}

_pages/team/rohan-timmaraju.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: "Compiler Research - Team - Rohan Timmaraju"
3+
layout: gridlay
4+
excerpt: "Compiler Research: Team members"
5+
sitemap: false
6+
permalink: /team/RohanTimmaraju
7+
8+
---
9+
10+
{% include team-profile.html %}
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: "Advanced symbol resolution and re-optimization for Clang-Repl"
3+
layout: post
4+
excerpt: "Advanced symbol resolution and re-optimization for Clang-Repl is a Google Summer of Code 2025 project. It aims to improve Clang-Repl and ORC JIT by adding support for automatically loading dynamic libraries when symbols are missing. This removes the need for users to load libraries manually and makes things work more smoothly."
5+
sitemap: false
6+
author: Sahil Patidar
7+
permalink: blogs/gsoc25_sahil_introduction_blog/
8+
banner_image: /images/blog/gsoc_clang_repl.jpeg
9+
date: 2025-05-18
10+
tags: gsoc LLVM clang-repl ORC-JIT auto-loading
11+
---
12+
13+
### Introduction
14+
15+
I am Sahil Patidar, a student during the 2025 Google Summer of Code. I will be
16+
working on the project "Advanced symbol resolution and re-optimization for Clang-Repl".
17+
18+
**Mentors**: Vassil Vassilev
19+
20+
### Overview of the Project
21+
22+
[Clang-Repl](https://clang.llvm.org/docs/ClangRepl.html) is a powerful interactive C++ interpreter that leverages LLVM’s ORC JIT to support incremental compilation and execution. Currently, users must manually load dynamic libraries when their code references external symbols, as Clang-Repl lacks the ability to automatically resolve symbols from dynamic libraries.
23+
To address this limitation, we propose a solution to enable **auto-loading of dynamic libraries for unresolved symbols** within ORC JIT, which is central to Clang-Repl’s runtime infrastructure.
24+
25+
Another part of this project is to add **re-optimization support** to Clang-Repl. Currently, Clang-Repl does not have the ability to optimize hot functions at runtime. With this feature, Clang-Repl will be able to detect frequently called functions and re-optimize them using a runtime call threshold.
26+
27+
### Objectives
28+
29+
* Implement **auto-loading** of dynamic libraries in ORC JIT.
30+
* Add **re-optimization support** to Clang-Repl for hot functions.
31+
32+
33+
### Implementation Details and Plans
34+
35+
The primary objective of this project is to enable **automatic loading of dynamic libraries for unresolved symbols** in Clang-Repl. Since Clang-Repl heavily relies on LLVM's **ORC JIT** for incremental compilation and execution, our work focuses on extending ORC JIT to support this capability for out-of-process execution enviroment.
36+
37+
Currently, ORC JIT handles dynamic library symbol resolution through the `DynamicLibrarySearchGenerator`, which is registered for each loaded dynamic library. This generator is responsible for symbol lookup and interacts with the **Executor Process Control** layer to resolve symbols during execution. Specifically, it uses a `DylibHandle` to identify which dynamic library to search for the unresolved symbol. On the executor side, the `SimpleExecutorDylibManager` API performs the actual lookup using this handle.
38+
39+
To support **auto-loading in out-of-process execution**, Lang Hames proposed a design involving two new components:
40+
41+
* **`ExecutorResolver` API**: This is an abstract interface for resolving symbols on the executor side. It can be implemented in different ways—for example:
42+
43+
* `PerDylibResolver`, which wraps a native handle for a specific library.
44+
* `AutoLoadDylibResolver`, which attempts to load libraries automatically when a symbol is unresolved.
45+
46+
The `SimpleExecutorDylibManager` will be responsible for creating and managing these resolvers, returning a `ResolverHandle` instead of the traditional `DylibHandle`.
47+
48+
* **`ExecutorSymbolResolutionGenerator`**: This generator replaces the existing `EPCDynamicLibrarySearchGenerator` for out-of-process execution. Unlike the previous design that relied on `DylibHandle`, this generator will use the new `ResolverHandle` to resolve symbols via the `ResolverHandle->resolve()` interface.
49+
50+
In out-of-process execution, **per-library lookup** requires an RPC call for each dynamic library when resolving a symbol. If the symbol is in the **(N-1)th** library, **N-1 RPC calls** are made—introducing significant overhead.
51+
In **auto-loading mode**, only one RPC call is made, but it scans all libraries, which is also inefficient if the symbol is missing.
52+
53+
To reduce this overhead, we propose using a **Bloom filter** to quickly check symbol presence in both modes before making costly lookups. The main challenge lies in designing an efficient and accurate filtering approach.
54+
55+
The second goal of this project is to add **re-optimization support** for Clang-Repl. Since ORC JIT is the core component used by Clang-Repl for runtime compilation and execution, we will build on its existing capabilities. ORC JIT supports runtime re-optimization using the `ReOptimizeLayer` and `RedirectableManager`.
56+
57+
At a high level, the `ReOptimizeLayer` emits boilerplate "sugar" code into the IR module. This code triggers a call to `__orc_rt_reoptimize_tag` when a threshold count is exceeded. This call is handled by `ReOptimizeLayer::rt_reoptimize`, which is triggered by the ORC runtime to generate an optimized version of a "hot" function. The `RedirectableManager` then updates the function’s stub pointer to point to the new optimized version. To achieve this, we will implement a custom `ReOptFunc`. If runtime profiling is needed to detect hot functions, we may also need to make small changes to the ORC runtime to collect this data.
58+
59+
### Conclusion
60+
61+
Upon completion of this project, ORC JIT will gain the ability to **automatically load dynamic libraries** to resolve previously unresolved symbols. Additionally, the integration of **filter-based optimizations** on the controller side will significantly reduce the overhead of unnecessary RPC calls.
62+
Overall, this work enhances the flexibility and performance of ORC JIT and improves the user experience in tools like Clang-Repl that rely on it.
63+
64+
65+
### Related Links
66+
67+
- [LLVM Repository](https://github.com/llvm/llvm-project)
68+
- [Project Description](https://discourse.llvm.org/t/gsoc2025-advanced-symbol-resolution-and-reoptimization-for-clang-repl/84624/3)
69+
- [My GitHub Profile](https://github.com/SahilPatidar)
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: "Supporting STL Concurrency Primitives in CLAD"
3+
layout: post
4+
excerpt: "Support for STL concurrency features in CLAD is a useful feature for applications utilizing cpu threads. Many applications of autodifferentiation benefit from parallel or concurrent processing, and support for some STL concurrency primitives such as threads and basic synchronization primitives may considerably simplify the user's design."
5+
sitemap: false
6+
author: Petro Mozil
7+
permalink: blogs/gsoc25_/
8+
banner_image: /images/blog/gsoc-banner.png
9+
date: 2025-05-18
10+
tags: gsoc llvm clang auto-differentiation
11+
---
12+
13+
## About me
14+
15+
I am Petro Mozil, a student participating in the Google Summer of Code program in 2025.
16+
I will work on adding support of STL concurrency primitives to CLAD.
17+
18+
## Problem description
19+
20+
`Clad` is a plugin for automatic differentiation for the `clang` compiler.
21+
Automatic differentiation is a term for multiple techniques of deriving a mathematical function analytically. Some of the ways of doing this include simply calculating the derivative numerically or by deriving a function by a set of rules, symbolically.
22+
23+
`Clad` provides an interface that returns an object that contained the derivative of a given function. There might be problems with some functions, if they are to be derived. For example, one would not derive `printf`, and neither would they derive `std::tread` - those are exceptions, and should be handled differently from mathematical functions.
24+
25+
The main goals of this project are to implement support for automatically derive functions that contain `std::thread` so that the user wouldn't have to separate the multi-processing logic from the mathematical functions - such a feature would be a great time-saver for production of multi-processing code.
26+
27+
## Objectives
28+
29+
The objectives for this project include adding support for multiple objects in STL, such as `std::thread`, `std::atomic`, `std::mutex`.
30+
31+
The first, and, likely, most important part of the project is to add support for `std::thread` - this will include deriving not the `std::thread` constructor, but deriving the function supplied for the thread.
32+
33+
Support for mutexes is a bit more straightforward - though `clad` creates a second object to represent the derived value, it shouldn't do so for a mutex. It is a matter of having a custom derivative for `std::mutex`.
34+
35+
Atomics will likely involve more effort - they would require custom derivatives for `compare_exchange` functions as well as their methods.
36+
37+
If time allows, I would also like to add support for `std::condition_variable`, `std::lock_guard`, `std::unique_lock` and `std::jthread`, and most of those would also only involve a custom derivative.
38+
39+
40+
## Conclusion
41+
42+
A a result of this project, support for the concurrency primitives is expected. Clad should seamlessly derive functions with concurrency primitives in them.
43+
Though this project does not focus on features immediately required from `clad`, it should result in making easier the lives of those, who use clad for high-perf computing.
44+
45+
## Related links
46+
47+
- [LLVM Repository](https://github.com/llvm/llvm-project)
48+
- [CLAD repository](https://github.com/vgvassilev/clad)
49+
- [Project description](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-STLConcurrency.html)
50+
- [My github](https://github.com/pmozil)
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation"
3+
layout: post
4+
excerpt: "This GSoC project leverages Clad to optimize LLM training in C++, aiming to boost efficiency by developing a custom tensor library and integrating Clad for compiler-level gradient calculations."
5+
sitemap: true
6+
author: Rohan Timmaraju
7+
permalink: blogs/gsoc25_rohan_introduction_blog/
8+
banner_image: /images/blog/LLM_project_banner.jpg
9+
date: 2025-05-21
10+
tags: gsoc c++ clang clad llm
11+
---
12+
13+
### Introduction
14+
15+
I am Rohan Timmaraju, a Computer Science student at Columbia University. During Google Summer of Code 2025, I will be working on the "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation" project with the Compiler Research group.
16+
17+
**Mentors**: Vassil Vassilev, David Lange, Jonas Rembser, Christina Koutsou
18+
19+
### About LLM Training
20+
21+
Large Language Models (LLMs) like ChatGPT have revolutionized AI, but their training is incredibly computationally intensive. Currently, Python-based frameworks such as PyTorch and TensorFlow are the go-to tools. While they offer excellent flexibility and a rich ecosystem, their reliance on interpreted execution and dynamic computation graphs can lead to performance bottlenecks and high memory consumption. This is particularly noticeable when we consider deploying or training these models in resource-constrained environments or within C++-centric high-performance computing (HPC) setups, which are common in scientific research.
22+
23+
While C++ provides the tools for fine-grained control over system resources and has proven its capabilities in efficient LLM inference (as seen with projects like [llama.cpp](https://github.com/ggml-org/llama.cpp)), the critical component for *training* – flexible and efficient Automatic Differentiation (AD) – presents an ongoing challenge for C++ solutions.
24+
25+
### Why Use Clad?
26+
27+
This project proposes to tackle this challenge by integrating Clad, an Automatic Differentiation plugin for the Clang compiler. Unlike traditional AD libraries that often operate at runtime, Clad performs source-to-source transformation. It analyzes the C++ Abstract Syntax Tree (AST) at compile time and generates optimized C++ code for computing derivatives. This compiler-level approach has the potential to reduce runtime overhead and improve memory efficiency compared to dynamic methods.
28+
29+
To facilitate this integration, I am developing a custom C++ tensor library to be used in neural network training. Inspired by the powerful approaches of libraries such as [llm.c](https://github.com/karpathy/llm.c) and [pytorch](https://docs.pytorch.org/cppdocs/), this library is being designed from the ground up with Clad compatibility in mind. The core idea is to replace manual or internally managed gradient computations with Clad's reverse-mode AD (as in `clad::gradient`) for key LLM operations like matrix multiplications, activation functions, normalization layers, and the final loss function.
30+
31+
### Implementation Plan
32+
1. **Foundation & Baseline:** The implementation will start by implementing a complete GPT-2 training loop in C++ *without* Clad. This will serve as our performance baseline. GPT-2 is chosen here as a relatively simple open-source LLM architecture capable of being trained on local devices. This could be extended to other architectures like Llama or Mistral.
33+
2. **Core Clad Integration Strategy:** We will investigate and evaluate different strategies for applying Clad to tensor network gradient calculations, potentially also identifying potential areas where Clad itself could be enhanced for deep learning workloads.
34+
3. **Expanding Integration:** Once a promising strategy is identified and validated on simpler operations, we'll systematically integrate Clad into more complex components of the GPT-2 architecture.
35+
4. **Benchmarking & Optimization:** Benchmarking against our baseline will be crucial to quantify the performance gains (speed, memory). We'll also use profiling tools to identify bottlenecks and optimize the tensor library with Clad. OpenMP may be employed for parallelization to further boost performance.
36+
5. **Documentation & Potential Extensions:** Thorough documentation of the tensor library, the Clad integration process, and our findings will also be a primary focus. Time permitting, we'll explore extending this work to other LLM architectures like Llama.
37+
38+
39+
### Conclusion
40+
By successfully integrating Clad into a C++ LLM training pipeline, we aim to:
41+
* **Demonstrate Performance Gains:** Show tangible improvements in training speed and memory efficiency.
42+
* **Clad for ML:** Provide a significant real-world use case, potentially identifying areas for Clad's improvement in supporting ML tasks.
43+
* **Offer a C++ Alternative:** Provide a foundation for more efficient, compiler-driven LLM training within the C++ ecosystems.
44+
* **Learn and Share:** Gain insights into the practicalities of applying compiler-based AD to complex ML problems and share these learnings with the community.
45+
46+
I believe this project has the potential to make a valuable contribution to both the compiler research field and the ongoing efforts to make powerful AI models more accessible and efficient to train.
47+
48+
### Related Links
49+
50+
- [Project Description](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-LLM.html)
51+
- [Clad Repository](https://github.com/vgvassilev/clad)
52+
- [My GitHub Profile](https://github.com/Rohan-T144)
187 KB
Binary file not shown.
171 KB
Binary file not shown.

images/blog/LLM_project_banner.jpg

354 KB
Loading

images/blog/gsoc_clang_repl.jpeg

99 KB
Loading

images/team/PetroMozil.jpg

107 KB
Loading

images/team/Rohan_Timmaraju.jpg

294 KB
Loading

0 commit comments

Comments
 (0)