diff --git a/readme.md b/readme.md index 87979f5..211a7fb 100644 --- a/readme.md +++ b/readme.md @@ -1,5 +1,11 @@ # LLM Papers We Recommend to Read +## ๐ŸŒ Language / ์–ธ์–ด +- **English** (Current) +- **[ํ•œ๊ตญ์–ด (Korean)](./translation/ko/readme_ko.md)** + +--- + The past several years has marked the steady rise of large language models (LLMs), largely driven by advancements in computational power, data availability, and algorithmic innovation. LLMs have profoundly shaped the research landscape, introducing new methodologies and paradigms that challenge traditional approaches. We have also expanded our research interests to the field of LLMs. Here are some research papers related to LLMs. We highly recommend beginners to read and thoroughly understand these papers. diff --git a/translation/ko/moe_related_ko.md b/translation/ko/moe_related_ko.md new file mode 100644 index 0000000..3192d25 --- /dev/null +++ b/translation/ko/moe_related_ko.md @@ -0,0 +1,38 @@ +## MoE ์ถ”๋ก  ์ตœ์ ํ™” + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference | [[paper]](http://arxiv.org/abs/2308.12066) | +| Fast Inference of Mixture-of-Experts Language Models with Offloading | [[paper]](http://arxiv.org/abs/2312.17238) | +| MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving | [[paper]](http://arxiv.org/abs/2401.14361) | +| Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models | [[paper]](http://arxiv.org/abs/2402.07033) | +| Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference | [[paper]](http://arxiv.org/abs/2401.08383) | +| SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models | [[paper]](http://arxiv.org/abs/2310.18859) | +| SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping | [[paper]](http://arxiv.org/abs/2308.15030) | +| Accelerating Distributed MoE Training and Inference with Lina | [[paper]](https://www.usenix.org/conference/atc23/presentation/li-jiamin) | +| Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference | [[paper]](http://arxiv.org/abs/2303.06182) | +| EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models | [[paper]](http://arxiv.org/abs/2308.14352) | +| AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference | [[paper]](http://arxiv.org/abs/2408.10284) | +| ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference | [[paper]](http://arxiv.org/abs/2410.17954) | +| ProMoE: Fast MoE-based LLM Serving using Proactive Caching | [[paper]](http://arxiv.org/abs/2410.22134) | +| HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference | [[paper]](http://arxiv.org/abs/2411.01433) | +| Toward Efficient Inference for Mixture of Experts | [[paper]](https://proceedings.neurips.cc/paper_files/paper/2024/hash/98bf3b8505c611ac21055dd9d355c66e-Abstract-Conference.html) | +| A Survey on Inference Optimization Techniques for Mixture of Experts Models | [[paper]](http://arxiv.org/abs/2412.14219) | +| MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services | [[paper]](http://arxiv.org/abs/2205.10034) | +| EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference | [[paper]](http://arxiv.org/abs/2410.12247) | +| fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving | [[paper]](http://arxiv.org/abs/2502.05370) | +| MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing | [[paper]](http://arxiv.org/abs/2502.06643) | +| Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline | [[paper]](http://arxiv.org/abs/2502.06888) | +| Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing | [[paper]](http://arxiv.org/abs/2501.05313) | +| DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference | [[paper]](http://arxiv.org/abs/2501.10375) | +| Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts | [[paper]](http://arxiv.org/abs/2502.19811) | +| Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion | [[paper]](https://dl.acm.org/doi/10.1145/3710848.3710868) | +| CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory | [[paper]](http://arxiv.org/abs/2503.02354) | +| eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference | [[paper]](http://arxiv.org/abs/2503.06823) | +| Accelerating MoE Model Inference with Expert Sharding | [[paper]](http://arxiv.org/abs/2503.08467) | +| Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores | [[paper]](http://arxiv.org/abs/2503.10725) | +| MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching | [[paper]](http://arxiv.org/abs/2503.09716) | +| MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism | [[paper]](https://arxiv.org/abs/2504.02263) | +| D$^2$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving | [[paper]](http://arxiv.org/abs/2504.15299) | +| Faster MoE LLM Inference for Extremely Large Models | [[paper]](http://arxiv.org/abs/2505.03531) | +| Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony | [[paper]](http://arxiv.org/abs/2505.08944) | \ No newline at end of file diff --git a/translation/ko/paper.md b/translation/ko/paper.md new file mode 100644 index 0000000..2eee37d --- /dev/null +++ b/translation/ko/paper.md @@ -0,0 +1,2 @@ +# paper + diff --git a/translation/ko/readme_ko.md b/translation/ko/readme_ko.md new file mode 100644 index 0000000..da7e7f5 --- /dev/null +++ b/translation/ko/readme_ko.md @@ -0,0 +1,165 @@ +# ์šฐ๋ฆฌ๊ฐ€ ์ฝ๊ธฐ๋ฅผ ์ถ”์ฒœํ•˜๋Š” LLM ๋…ผ๋ฌธ๋“ค + +## ๐ŸŒ Language / ์–ธ์–ด +- **[English](../../readme.md)** +- **ํ•œ๊ตญ์–ด (Korean)** (Current) + +--- + +์ตœ๊ทผ ๋ช‡ ๋…„ ๋™์•ˆ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLMs)์ด ๊พธ์ค€ํžˆ ๋ฐœ์ „ํ•ด์™”๊ณ  ์ด๋Š” ๊ณ„์‚ฐ ๋Šฅ๋ ฅ, ๋ฐ์ดํ„ฐ ๊ฐ€์šฉ์„ฑ, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ˜์‹  ๋•๋ถ„์ž…๋‹ˆ๋‹ค. LLM๋“ค์€ ์—ฐ๊ตฌ ํ™˜๊ฒฝ์„ ๊ทผ๋ณธ์ ์œผ๋กœ ๋ณ€ํ™”์‹œ์ผฐ๊ณ , ์ „ํ†ต์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์— ๋„์ „ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ๊ณผ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. + +์ €ํฌ๋„ ์ด๋Ÿฐ ํ๋ฆ„์— ๋”ฐ๋ผ ์—ฐ๊ตฌ ๊ด€์‹ฌ์‚ฌ๋ฅผ LLM ๋ถ„์•ผ๋กœ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ์ดํ•˜ ์ œ์‹œ๋˜๋Š” ๋‚ด์šฉ๋“ค์€ LLM๊ณผ ๊ด€๋ จ๋œ ๋ช‡ ๊ฐ€์ง€ ์—ฐ๊ตฌ ๋…ผ๋ฌธ๋“ค์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ดˆ๋ณด์ž๋“ค์ด ์ด๋Ÿฌํ•œ ๋…ผ๋ฌธ๋“ค์„ ์ฝ๊ณ  ์ฒ ์ €ํžˆ ์ดํ•ดํ•˜๊ธฐ๋ฅผ ๊ฐ•๋ ฅํžˆ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค. + +:smile: **์šฐ๋ฆฌ๋Š” ๋ชจ๋“  ๊ธฐ์—ฌ๋ฅผ ํ™˜์˜ํ•˜๊ณ  ์†Œ์ค‘ํžˆ ์—ฌ๊น๋‹ˆ๋‹ค.** + +## LLM์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ(Basic Architectures of LLMs) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| Sequence to Sequence Learning with Neural Networks | [[paper]](https://arxiv.org/abs/1409.3215) | +| Transformer: Attention Is All You Need | [[paper]](http://arxiv.org/abs/1706.03762) | +| BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | [[paper]](https://arxiv.org/abs/1810.04805) | +| GPT: Improving Language Understanding by Generative Pre-Training | [[paper]](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) | +| GPT2: Language Models are Unsupervised Multitask Learners | [[paper]](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | +| GPT3: Language Models are Few-Shot Learners | [[paper]](https://arxiv.org/abs/2005.14165) | +| GPT3.5: Fine-Tuning Language Models from Human Preferences | [[paper]](https://arxiv.org/abs/1909.08593) | +| LLaMA: Open and Efficient Foundation Language Models | [[paper]](http://arxiv.org/abs/2302.13971) | +| Llama 2: Open Foundation and Fine-Tuned Chat Models | [[paper]](http://arxiv.org/abs/2307.09288) | +| Qwen2.5-1M Technical Report | [[paper]](https://arxiv.org/abs/2501.15383) | + +> **์ดํ•˜์˜ ๋ชจ๋“  ๋…ผ๋ฌธ์„ ์ฝ๊ธฐ ์ „์— ๊ธฐ๋ณธ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์„ ์ฝ๋Š”๊ฒƒ์„ ๊ฐ•๋ ฅํžˆ ๊ถŒ๊ณ ํ•ฉ๋‹ˆ๋‹ค.** + +### ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(Multimodal Large Language Models) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------- | +| Efficient Multimodal Large Language Models: A Survey | [[paper]](http://arxiv.org/abs/2405.10739) | +| CLIP: Learning Transferable Visual Models From Natural Language Supervision | [[paper]](https://arxiv.org/abs/2103.00020) | +| Seed1.5-VL Technical Report | [[paper]](http://arxiv.org/abs/2505.07062) | +| MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining | [[paper]](http://arxiv.org/abs/2505.07608) | + +> ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์ด๋ž€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. + +## ๋ณ‘๋ ฌ ํ›ˆ๋ จ ์‹œ์Šคํ…œ(Parallelism Training System) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | [[paper]](http://arxiv.org/abs/1909.08053) | +| ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | [[paper]](http://arxiv.org/abs/1910.02054) | +| ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning | [[paper]](http://arxiv.org/abs/2104.07857) | +| ZeRO-Offload: Democratizing Billion-Scale Model Training | [[paper]](https://www.usenix.org/conference/atc21/presentation/ren-jie) | +| PipeDream: generalized pipeline parallelism for DNN training | [[paper]](https://arxiv.org/abs/1806.03377) | +| GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | [[paper]](http://arxiv.org/abs/1811.06965) | +| TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models | [[paper]](http://arxiv.org/abs/2102.07988) | +| GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding | [[paper]](http://arxiv.org/abs/2006.16668) | +| PanGu-$\Sigma$: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing | [[paper]](http://arxiv.org/abs/2303.10845) | +| DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale | [[paper]](https://proceedings.mlr.press/v162/rajbhandari22a.html) | +| Accelerating Distributed MoE Training and Inference with Lina | [[paper]](https://www.usenix.org/conference/atc23/presentation/li-jiamin) | +| Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism | [[paper]](http://arxiv.org/abs/2211.13878) | +| Alpa: Automating Inter- and {Intra-Operator} Parallelism for Distributed Deep Learning | [[paper]](https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin) | +| Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs | [[paper]](http://arxiv.org/abs/2505.04519) | + +> ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ํ›ˆ๋ จ ์‹œ์Šคํ…œ(Parallelism Training System)์ด๋ž€ ๋Œ€๊ทœ๋ชจ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ(ํŠนํžˆ LLM)์„ ์—ฌ๋Ÿฌ ๋Œ€์˜ GPU, TPU, ๋˜๋Š” ์ปดํ“จํŠธ ๋…ธ๋“œ์— ๋ถ„์‚ฐ์‹œ์ผœ ๋™์‹œ์— ํ•™์Šต์‹œํ‚ค๋Š” ๊ธฐ์ˆ  ๋ฐ ์‹œ์Šคํ…œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. + +## LLM ์„œ๋น™ ์‹œ์Šคํ…œ(LLM Serving System) + +LLM ์„œ๋น™ ์ž‘์—…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ถ„์•ผ๋กœ ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค: *์‹œ์Šคํ…œ ์ตœ์ ํ™”* (์˜ˆ: vLLM), *์Šค์ผ€์ค„๋ง ์ตœ์ ํ™”* (์˜ˆ: DistServe, Llumnix), *์˜คํ”„๋กœ๋”ฉ* (์˜ˆ: FlexGen), *์ ‘๋‘์‚ฌ ๊ณต์œ *, *KV ์บ์‹œ ์••์ถ•/์ œ๊ฑฐ/์„ ํƒ*, ๊ทธ๋ฆฌ๊ณ  *์ถ”์ธก์  ๋””์ฝ”๋”ฉ*. + +ํ–ฅํ›„ ์‹ค์ œ ๋ถ„๋ฅ˜๋ฅผ ์ง„ํ–‰ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | [[paper]](http://arxiv.org/abs/2312.15234) | +| FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | [[paper]](https://arxiv.org/abs/2205.14135) | +| Efficiently Scaling Transformer Inference | [[paper]](http://arxiv.org/abs/2211.05102) | +| vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention | [[paper]](http://arxiv.org/abs/2309.06180) | +| DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | [[paper]](https://arxiv.org/abs/2207.00032) | +| Orca: A Distributed Serving System for Transformer-Based Generative Models | [[paper]](https://www.usenix.org/conference/osdi22/presentation/yu) | +| FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | [[paper]](http://arxiv.org/abs/2303.06865) | +| S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput | [[paper]](http://arxiv.org/abs/2306.06000) | +| Splitwise: Efficient generative LLM inference using phase splitting | [[paper]](http://arxiv.org/abs/2311.18677) | +| SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification | [[paper]](https://arxiv.org/abs/2305.09781) | +| Petals: Collaborative Inference and Fine-tuning of Large Models | [[paper]](https://aclanthology.org/2023.acl-demo.54) | +| PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | [[paper]](http://arxiv.org/abs/2312.12456) | +| DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | [[paper]](http://arxiv.org/abs/2401.09670) | +| LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | [[paper]](http://arxiv.org/abs/2404.09526) | +| Vidur: A Large-Scale Simulation Framework For LLM Inference | [[paper]](http://arxiv.org/abs/2405.05465) | +| Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers | [[paper]](http://arxiv.org/abs/2405.10480) | +| AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving | [[paper]](http://arxiv.org/abs/2302.11665) | +| SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | [[paper]](http://arxiv.org/abs/2308.16369) | +| Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | [[paper]](http://arxiv.org/abs/2403.02310) | +| Llumnix: Dynamic Scheduling for Large Language Model Serving | [[paper]](http://arxiv.org/abs/2406.03243) | +| Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | [[paper]](http://arxiv.org/abs/2407.00079) | +| InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management | [[paper]](http://arxiv.org/abs/2406.19707) | +| ServerlessLLM: Low-Latency Serverless Inference for Large Language Models | [[paper]](https://www.usenix.org/conference/osdi24/presentation/fu) | +| Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs | [[paper]](http://arxiv.org/abs/2410.17840) | +| NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | [[paper]](http://arxiv.org/abs/2411.01142) | +| EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving | [[paper]](http://arxiv.org/abs/2411.06364) | +| Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation | [[paper]](http://arxiv.org/abs/2503.20552) | +| semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage | [[paper]](http://arxiv.org/abs/2504.19867) | + +> LLM Serving System์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ **์‹ค์ œ ์„œ๋น„์Šค ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉ์ž ์š”์ฒญ์— ๋”ฐ๋ผ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ์‘๋‹ตํ•˜๋„๋ก ๋ฐฐํฌยท์šด์˜ํ•˜๋Š” ์‹œ์Šคํ…œ์„ ์˜๋ฏธ**ํ•ฉ๋‹ˆ๋‹ค. + +### ๋‹ค์ค‘ LoRA๋ฅผ ์‚ฌ์šฉํ•œ LLM ์„œ๋น™(Serving LLMs with Multiple LoRAs) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| PetS: A Unified Framework for Parameter-Efficient Transformers Serving | [[paper]](https://www.usenix.org/conference/atc22/presentation/zhou-zhe) | +| Punica: Multi-Tenant LoRA Serving | [[paper]](http://arxiv.org/abs/2310.18547) | +| S-LoRA: Serving Thousands of Concurrent LoRA Adapters | [[paper]](http://arxiv.org/abs/2311.03285) | +| dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving | [[paper]](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang) | + +> LoRA(Low-Rank Adaptation)๋Š” ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ํšจ์œจ์ ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •(fine-tuning)ํ•˜๊ธฐ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ์ตœ์‹  ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. + +## ๋งค๊ฐœ๋ณ€์ˆ˜ ํšจ์œจ์  ๋ฏธ์„ธ ์กฐ์ • (Parameter-Efficient Fine-Tuning, PEFT) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models | [[paper]](http://arxiv.org/abs/2203.06904) | +| Parameter-Efficient Transfer Learning for NLP | [[paper]](https://proceedings.mlr.press/v97/houlsby19a.html) | +| Prefix-Tuning: Optimizing Continuous Prompts for Generation | [[paper]](http://arxiv.org/abs/2101.00190) | +| LoRA: Low-Rank Adaptation of Large Language Models | [[paper]](http://arxiv.org/abs/2106.09685) | +| Towards a Unified View of Parameter-Efficient Transfer Learning | [[paper]](http://arxiv.org/abs/2110.04366) | +| Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning | [[paper]](http://arxiv.org/abs/2303.10512) | +| When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method | [[paper]](http://arxiv.org/abs/2402.17193) | +| Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning | [[paper]](http://arxiv.org/abs/2409.01035) | +| DoRA: Weight-Decomposed Low-Rank Adaptation | [[paper]](http://arxiv.org/abs/2402.09353) | +| GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection | [[paper]](http://arxiv.org/abs/2403.03507) | + +> Parameter-Efficient Fine-Tuning(PEFT)์€ ๊ธฐ์กด์˜ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ชจ๋‘ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹(Full Fine-Tuning)๊ณผ ๋‹ฌ๋ฆฌ ๋ชจ๋ธ์˜ ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์„ ํƒ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์†Œ๊ทœ๋ชจ์˜ ์ถ”๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ(์˜ˆ: ์–ด๋Œ‘ํ„ฐ, LoRA ๋“ฑ)๋ฅผ ๋„์ž…ํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. + +## ์••์ถ• (์–‘์žํ™”, ํฌ์†Œ์„ฑ) : Compression (Quantization, Sparsity) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------- | +| LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | [[paper]](http://arxiv.org/abs/2208.07339) | +| SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | [[paper]](http://arxiv.org/abs/2211.10438) | +| GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | [[paper]](https://arxiv.org/abs/2210.17323) | +| AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | [[paper]](http://arxiv.org/abs/2306.00978) | +| QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | [[paper]](http://arxiv.org/abs/2405.04532) | +| Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | [[paper]](https://arxiv.org/abs/2310.17157) | +| Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | [[paper]](http://arxiv.org/abs/2310.19102) | +| QLoRA: Efficient Finetuning of Quantized LLMs | [[paper]](http://arxiv.org/abs/2305.14314) | +| QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models | [[paper]](http://arxiv.org/abs/2309.14717) | + +> ์–‘์žํ™”๋ž€ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋” ์ ์€ ๋น„ํŠธ ์ˆ˜๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ณ , ์—ฐ์‚ฐ ์†๋„๋ฅผ ๋†’์ด๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. +> ํฌ์†Œ์„ฑ์ด๋ž€ ๋ชจ๋ธ ๋‚ด๋ถ€์˜ ๋‰ด๋Ÿฐ ํ™œ์„ฑํ™”๋‚˜ ๊ฐ€์ค‘์น˜ ๋“ฑ์—์„œ โ€œ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด 0์ด๊ฑฐ๋‚˜ ๊ฑฐ์˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ณ  ์†Œ์ˆ˜๋งŒ ์˜๋ฏธ ์žˆ๊ฒŒ ์ž‘๋™ํ•˜๋Š”โ€ ํŠน์„ฑ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. + +## ์ „๋ฌธ๊ฐ€ ํ˜ผํ•ฉ (Mixture of Experts) ๊ด€๋ จ ์ตœ์ ํ™” + +MoE ์ถ”๋ก  ์ตœ์ ํ™”์— ๋Œ€ํ•ด์„œ๋Š” ์ด [๋งํฌ](./moe_related_ko.md)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”. + +> Mixture of Experts๋ž€ ํ•œ ๋ชจ๋ธ์ด ๋ชจ๋“  ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , ์—ฌ๋Ÿฌ "์ „๋ฌธ๊ฐ€(Expert)" ๋„คํŠธ์›Œํฌ ์ค‘ ์ผ๋ถ€๋งŒ ์„ ํƒ์ ์œผ๋กœ ํ™œ์„ฑํ™”ํ•ด์„œ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. + +## LLM์„ ์œ„ํ•œ ๊ฐ•ํ™” ํ•™์Šต, ์‹œ์Šคํ…œ ์ตœ์ ํ™”(Reinforced Learning for LLMs, System Optimization) + +์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ๋ถ€ํ„ฐ์˜ ๊ฐ•ํ™” ํ•™์Šต (RLHF) & ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๋ณด์ƒ์„ ํ†ตํ•œ ๊ฐ•ํ™” ํ•™์Šต (RLVR) + +| ์ œ๋ชฉ | ๋งํฌ | +| ------------------------------------------------------------ | ------------------------------------------- | +| OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework | [[paper]](https://arxiv.org/abs/2405.11143) | +| HybridFlow: A Flexible and Efficient RLHF Framework | [[paper]](https://arxiv.org/abs/2409.19256) | +| StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation | [[paper]](http://arxiv.org/abs/2504.15930) | + +> RLHF(์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ๋ถ€ํ„ฐ์˜ ๊ฐ•ํ™” ํ•™์Šต)๋Š” ๋ชจ๋ธ์ด ์ธ๊ฐ„์˜ ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•ด ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. +> RLVR(๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๋ณด์ƒ์„ ํ†ตํ•œ ๊ฐ•ํ™” ํ•™์Šต)๋Š” ๋ชจ๋ธ์ด ์™ธ๋ถ€ ๋ณด์ƒ ์‹ ํ˜ธ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. \ No newline at end of file