Skip to content

📕Large Language Models Acceleration Paper List

Notifications You must be signed in to change notification settings

Zhenyu001225/LLMs-Acceleration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLMs Acceleration Paper List

Here is the list of papers categorized by technology.

A list categorized by conference will be added in the future.

Methodology Papers Publication Venue/Affiliations Materials
Quantization SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Arxiv 2023 code
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Arxiv 2023 code
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023 code
OWQ: Lessons learned from activation outliers for weight quantization in large language models Arxiv 2023 code
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models Arxiv 2022
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers NeurIPS 2022 code
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 code
RPTQ: Reorder-based Post-training Quantization for Large Language Models Arxiv 2023 code
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling Arxiv 2023
Sparsity Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning Arxiv 2023 code
A Simple and Effective Pruning Approach for Large Language Models Arxiv 2023 code
LLM-Pruner: On the Structural Pruning of Large Language Models Arxiv 2023 code
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML 2023 code
SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ICML 2023 code
Attention Pattern Efficient Streaming Language Models with Attention Sinks Arxiv 2023 code
LongLORA: Efficient Fine-tuning of Long Context Large Language Model Arxiv 2023 code
Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences Arxiv 2023
Architecture-level Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity VLDB 2024 code
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Arxiv 2023 code
FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS 2022 code
System-level FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU ICML 2023 code

About

📕Large Language Models Acceleration Paper List

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published