Implement kv cache #74

jmercat · 2023-11-08T02:47:22Z

Explanation:

KV cache is an inference trick to avoid re-computing the whole sequence when only one token was added. It stores keys and queries of each attention layer in a list for later use.
If use_cache is set to True HuggingFace generator automatically handles this and will only input the last token along with the KV cache when generating a sequence.
The past keys and values are extracted from the cache and concatenated to the next token in each attention layer. Only one query attends this sequence to generate the next token.
Positional embeddings need to be offset because it can no longer rely on the position of the query in the sequence (there is only 1 query).

Tests:

LLaMA2 and a custom trained model have been tested qualitatively with the generation.py script with this change (with and without the flag --use-cache). Jeopardy with LLaMA2 was also ran to ensure no regression.

Changes:

The model outputs past_key_values regardless of the use_cache (returns None if not used) thus it now returns 3 output values instead of 2
Optional inputs past_key_values and use_cache are taken by the model and passed all the way to the attention computation
Positional embedding takes ~~two~~ an extra offset argument to offset the queries and keys in the time sequence. This is useful to offset the new query and key to the end of the kv cache sequence
head rotary is not compatible with offsets but will run anyway since it doesn't need to offset anything

open_lm/model.py

open_lm/positional_embedding/rotary.py

open_lm/train.py

achalddave · 2023-11-10T17:01:06Z

Other than the one comment about adding comments to _update_cos_sin_tables, this looks good to me! Would it be possible to document the speedup we get with KV-cache, compared to say HF LLaMa2? Doesn't need to be super thorough, just a quick gut check that we have roughly the right speedup.

achalddave · 2023-11-10T17:38:54Z

oh and finally, could you paste the outputs of pytest so we can merge? (I believe dataloading tests currently fail due to a missing assets/ folder that we're working on adding, but the rest should pass)

jmercat · 2023-11-13T19:20:36Z

oh and finally, could you paste the outputs of pytest so we can merge? (I believe dataloading tests currently fail due to a missing assets/ folder that we're working on adding, but the rest should pass)

There were bugs but most didn't come from my changes:

I added ignore_parse_errors to MockDataArgs.
In train.py l.144 sample chunk I updated to the new args input
in test accumulation I updated the SimpleModel to output 3 items (this was due to my changes)

I added a test for kv cache but it is slow so I also added a slow marker in pyproject.toml, slow tests can be marked with @pytest.mark.slow
Tests marked as slow can be skipped with pytest . -m "not slow" this might help with #59 (comment)

jmercat · 2023-11-13T19:21:02Z

jmercat · 2023-11-13T23:47:21Z

With the attached script for a 1b model I get the following results:


Context length | Generation length | Time with cache | Time without cache | Time gain (%)
--------------------------------------------------------------------------------------------
512            | 512               | 18.1522          | 30.4054             | 40.30          
512            | 1024              | 20.8708          | 80.5402             | 74.09          
512            | 1536              | 29.9303          | 152.6310            | 80.39          
1024           | 512               | 15.6641          | 50.1081             | 68.74          
1024           | 1024              | 25.5443          | 123.8644            | 79.38          
1536           | 512               | 19.4633          | 73.8159             | 73.63          
--------------------------------------------------------------------------------------------

time_generate.txt

(the test was done on 1 GPU Nvidia A6000)

jmercat · 2023-11-15T02:36:12Z

I added a test that checks the generated sequences. It is not very stable:
If we generate long sequences (sometimes as little as 64 tokens), they start to diverge between cached and not cached but not between two non-cached generations. I considered that this was not a bug but the result of compounding errors so I set the test to only compare the first 32 generated tokens. This ends-up passing.

jmercat · 2023-11-28T01:41:54Z

I've added some support for kv_cache + an input sequence (prior it only worked with a unique new input).
However, it assumes the input sequence to come directly after the kv_cache in the sequence and does not support input position indices.

I've made 2 separate tests: one for speed on a randomly initialized model. One for concistency that loads the 1B pre-trained OpenLM weights and test that the generated results are the same with kv_cache and without.

I didn't produce any test for beam search...

Two tests were not passing but I don't think it's from my changes:

jmercat · 2023-11-29T00:56:54Z

Ok so now check are passing but they do not include the two tests that I added (they are marked both as slow and gpu). One of them involves downloading the 1B model and using it to generate short sequences.

jmercat · 2023-12-09T01:12:07Z

I had to force a minimum version of multiprocess which depends on a newer version of dill than what apache-beam wants. I could not find a common ground so I remove apache-beam from the requirements and had to add ipython.

achalddave

This looks good to me, except as discussed just revert the formatting changes from tests/. We can add those formatting changes in a separate PR.

ruixin31 · 2023-12-12T04:00:35Z

Looks good to me except for the two things we discussed offline. I was also wondering if we could look at the coverage of tests but that could be done in a separate PR.

achalddave · 2023-12-12T04:06:42Z

nice! For future reference and documentation, could you comment what the two things you discussed offline are, @ruixin31?

* Allow parsing parameters from a config file * Address nits

…' has no attribute '_Condition'.

…e xformers version that we need

jmercat · 2023-12-13T00:06:40Z

nice! For future reference and documentation, could you comment what the two things you discussed offline are, @ruixin31?

So what Rui suggested is to use xformers new masking xops.fmha.attn_bias.LowerTriangularFromBottomRightMask()
It works nicely and should be easier to read and faster when using beam search. Sadly it is not compatible with llm_foundry because their dependencies don't match (pip install -r requirements.txt would fail but it could be installed in 2 steps successfully...)

I added comments about that in the code but reverted to using custom masks to make installation faster (we don't use beam search now so it doesn't justify the extra step in installation)

jmercat force-pushed the kv_cache branch 5 times, most recently from 61f3b16 to 75fb8cd Compare November 10, 2023 00:53

achalddave reviewed Nov 10, 2023

View reviewed changes

open_lm/model.py Outdated Show resolved Hide resolved

open_lm/positional_embedding/rotary.py Outdated Show resolved Hide resolved

open_lm/train.py Show resolved Hide resolved

achalddave self-assigned this Nov 10, 2023

jmercat force-pushed the kv_cache branch from 75fb8cd to adb3903 Compare November 10, 2023 22:44

jmercat force-pushed the kv_cache branch from adb3903 to 3006266 Compare November 13, 2023 19:21

jmercat mentioned this pull request Nov 13, 2023

Load LLaMA2 from HF checkpoint #59

Closed

achalddave mentioned this pull request Nov 15, 2023

HF Integration #89

Open

jmercat force-pushed the kv_cache branch from 3006266 to b6b53bc Compare November 15, 2023 02:31

jmercat force-pushed the kv_cache branch 2 times, most recently from fb5bdfa to 4c4f73c Compare November 15, 2023 05:23

jmercat force-pushed the kv_cache branch from 4c4f73c to 1b402d2 Compare November 28, 2023 01:32

jmercat changed the title ~~Implement kv cache similar to HF~~ Implement kv cache Nov 28, 2023

jmercat force-pushed the kv_cache branch from 1b402d2 to bcef4fc Compare November 28, 2023 01:34

jmercat force-pushed the kv_cache branch 3 times, most recently from ce7c6bc to c5a72bc Compare November 29, 2023 00:40

jmercat requested review from achalddave and ruixin31 November 29, 2023 01:25

jmercat force-pushed the kv_cache branch 6 times, most recently from dc3d9bd to 4cec766 Compare December 9, 2023 01:00

achalddave changed the base branch from main to package-imports December 9, 2023 01:16

achalddave changed the base branch from package-imports to main December 9, 2023 01:16

achalddave approved these changes Dec 9, 2023

View reviewed changes

jmercat force-pushed the kv_cache branch 4 times, most recently from 99994e4 to ec80c71 Compare December 9, 2023 01:47

achalddave and others added 7 commits December 12, 2023 14:59

Allow parsing parameters from a config file (mlfoundations#141)

7a706e3

* Allow parsing parameters from a config file * Address nits

Support for beam search and kv_cache

01a67c4

Added tiny test + black

82edd1c

separate tiny generate test from generate test

36f063b

set dataset version to avoid error: AttributeError: module 'threading…

35ddfbe

…' has no attribute '_Condition'.

compatible datasets version and unified python versions to 3.10

6851f47

Using xformers LowerTriangularFromBottomRightMask instead of custom mask

428a4dc

jmercat force-pushed the kv_cache branch from ec80c71 to 2b36e8e Compare December 13, 2023 00:00

revert back to using custom mask because it is not compatible with th…

ca83264

…e xformers version that we need

jmercat force-pushed the kv_cache branch from 2b36e8e to ca83264 Compare December 13, 2023 00:01

added tiny beam generation test and fixed cache reorder (thanks Rui)

28ce58b

achalddave merged commit e016855 into mlfoundations:main Dec 13, 2023
2 checks passed

jmercat deleted the kv_cache branch December 14, 2023 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement kv cache #74

Implement kv cache #74

jmercat commented Nov 8, 2023 •

edited

Loading

achalddave commented Nov 10, 2023

achalddave commented Nov 10, 2023

jmercat commented Nov 13, 2023

jmercat commented Nov 13, 2023

jmercat commented Nov 13, 2023 •

edited

Loading

jmercat commented Nov 15, 2023

jmercat commented Nov 28, 2023

jmercat commented Nov 29, 2023

jmercat commented Dec 9, 2023

achalddave left a comment

ruixin31 commented Dec 12, 2023

achalddave commented Dec 12, 2023

jmercat commented Dec 13, 2023

Implement kv cache #74

Implement kv cache #74

Conversation

jmercat commented Nov 8, 2023 • edited Loading

Explanation:

Tests:

Changes:

achalddave commented Nov 10, 2023

achalddave commented Nov 10, 2023

jmercat commented Nov 13, 2023

jmercat commented Nov 13, 2023

jmercat commented Nov 13, 2023 • edited Loading

jmercat commented Nov 15, 2023

jmercat commented Nov 28, 2023

jmercat commented Nov 29, 2023

jmercat commented Dec 9, 2023

achalddave left a comment

Choose a reason for hiding this comment

ruixin31 commented Dec 12, 2023

achalddave commented Dec 12, 2023

jmercat commented Dec 13, 2023

jmercat commented Nov 8, 2023 •

edited

Loading

jmercat commented Nov 13, 2023 •

edited

Loading