GitHub - Zhenyu001225/LLMs-Acceleration: 📕Large Language Models Acceleration Paper List

LLMs Acceleration Paper List

Here is the list of papers categorized by technology.

A list categorized by conference will be added in the future.

Methodology	Papers	Publication Venue/Affiliations	Materials
Quantization	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	Arxiv 2023	code
	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	Arxiv 2023	code
	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR 2023	code
	OWQ: Lessons learned from activation outliers for weight quantization in large language models	Arxiv 2023	code
	LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models	Arxiv 2022
	Zeroquant: Efficient and affordable post-training quantization for large-scale transformers	NeurIPS 2022	code
	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML 2023	code
	RPTQ: Reorder-based Post-training Quantization for Large Language Models	Arxiv 2023	code
	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	Arxiv 2023

Sparsity	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	Arxiv 2023	code
	A Simple and Effective Pruning Approach for Large Language Models	Arxiv 2023	code
	LLM-Pruner: On the Structural Pruning of Large Language Models	Arxiv 2023	code
	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML 2023	code
	SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot	ICML 2023	code

Attention Pattern	Efficient Streaming Language Models with Attention Sinks	Arxiv 2023	code
	LongLORA: Efficient Fine-tuning of Long Context Large Language Model	Arxiv 2023	code
	Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences	Arxiv 2023

Architecture-level	Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity	VLDB 2024	code
	vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention	Arxiv 2023	code
	FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness	NeurIPS 2022	code
System-level	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	ICML 2023	code

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Attention Pattern		Attention Pattern
Quantization		Quantization
Sparsity		Sparsity
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMs Acceleration Paper List

About

Releases

Packages

Zhenyu001225/LLMs-Acceleration

Folders and files

Latest commit

History

Repository files navigation

LLMs Acceleration Paper List

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages