LSH KV Cache Eviction (ICLR 2025)

Introduction

Official code repository for the paper: LSH Tells You What to Discrad: An Adaptive Locality-Sensitive Strategy For KV Cache Compression

Our implementation is based on cold-compress. We are working on merging our implementation in to the main branch of cold-compress.

Installation

pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/

After logging in with huggingface-cli login, run any of the following:

bash scripts/prepare_llama3.sh

This will download model and tokenizer files from HuggingFace for Meta-Llama-3-8B-Instruct and save them into a usable format inside ./checkpoints.

Quick Start

Generate Response

python generate.py --cache_strategy lsh --lsh_dim 8 --prompt "What does the llama say?" --checkpoint_path ./checkpoints/meta-llama/Meta-Llama-3-8B-Instruct/model.pth

This will generate a response from a compiled Llama-3 with 8-bit LSH eviction (--cache_strategy lsh --lsh_dim 8).

Run a single experiment using a cache config

python eval.py --cache_config lsh --lsh_dim 8 --tasks gsm8k

For a list of tasks, please refer to tasks.py. For a list of cache configs please refer to COLD_COMPRESS_README.md and cache.py.

eval.py creates a directory under results based on the supplied cache arguments to dump the raw predictions and metrics for memory usage and task-specific performance. For

Experiments in Parallel

Here is how to run the L2, LSH and Full cache config comparision experiements in parallel using multiple GPUs on a single machine.

Free Response Question Answering

Before you run question answering experiments, you need to set an openai api key

export OPENAI_API_KEY=[api key]

because the GPT4-Judge requires openai api access. To turn off the GPT4-Judge metric, please edit the corresponding class in tasks.py and remove GPT4-Judge from the its list of metrics.

To run the gsm8k free response question answering experiments:

python parallelize_evals.py --command_file experiments/gsm8k.txt --num_gpus 8

To run the medqa free response question answering experiments:

python parallelize_evals.py --command_file experiments/medqa.txt --num_gpus 8

Replace the number of GPUs with the correct number of GPUs on your machine.

Multiple Choice

To run the gsm8k_mc multiple choice experiments:

python parallelize_evals.py --command_file experiments/gsm8k_,c.txt --num_gpus 8

To run the medqa_mc multiple choice experiments:

python parallelize_evals.py --command_file experiments/medqa_mc.txt --num_gpus 8

Long Context

To run the needle in a haystack long context experiments:

python parallelize_evals.py --command_file experiments/rulerniah.txt --num_gpus 8

To run the common words long context experiments:

python parallelize_evals.py --command_file experiments/cwe.txt --num_gpus 8

Visualizations

To generate visualizations used in the paper, please refer to instructions in VISUALIZATIONS.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
cache_configs		cache_configs
charts		charts
data		data
experiments		experiments
images		images
prompts		prompts
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
BENCHMARK.md		BENCHMARK.md
COLD_COMPRESS_README.md		COLD_COMPRESS_README.md
DISCLAIMER.md		DISCLAIMER.md
GPTQ.py		GPTQ.py
LICENSE		LICENSE
README.md		README.md
VISUALIZATIONS.ipynb		VISUALIZATIONS.ipynb
attention_utils.py		attention_utils.py
cache.py		cache.py
eval.py		eval.py
eval_multi.py		eval_multi.py
gather_iclr_results.ipynb		gather_iclr_results.ipynb
gcp_setup.sh		gcp_setup.sh
generate.py		generate.py
generation_utils.py		generation_utils.py
metric.py		metric.py
model.py		model.py
parallelize_evals.py		parallelize_evals.py
prompt_compression.py		prompt_compression.py
quantization_utils.py		quantization_utils.py
quantize.py		quantize.py
requirements.txt		requirements.txt
setup.py		setup.py
submit.sh		submit.sh
task.py		task.py
tokenizer.py		tokenizer.py
tp.py		tp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSH KV Cache Eviction (ICLR 2025)

Introduction

Installation

Quick Start

Generate Response

Run a single experiment using a cache config

Experiments in Parallel

Free Response Question Answering

Multiple Choice

Long Context

Visualizations

About

Releases

Packages

Languages

License

minghui-liu/cold-compress

Folders and files

Latest commit

History

Repository files navigation

LSH KV Cache Eviction (ICLR 2025)

Introduction

Installation

Quick Start

Generate Response

Run a single experiment using a cache config

Experiments in Parallel

Free Response Question Answering

Multiple Choice

Long Context

Visualizations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages