llama 3.1 405B fp8 support #383

endomorphosis · 2024-07-31T04:19:11Z

I have been staging some updates testing the tgi-gaudi software with llama 405B fp8, i am waiting for habana optimum to approve the PR, and then I will submit a pr for huggingface/tgi_gaudi and will submit a PR for TGI in the microservices.

I got it running on xeon with llama_cpp (which is what ollama is based on) at 1 tok/s on sapphire rapids, but am going to test speculative decoding for llama 3.1 8b, which should improve performance 10-20 times depending on how many tokens can be completed by the draft model. However ollama is broken and that will need to be investigated further.

endomorphosis · 2024-08-02T09:12:56Z

Signed-off-by: Ruoyu Ying <[email protected]>

endomorphosis · 2024-08-11T22:47:58Z

I have thoroughly gone through all of the examples and interfaces with regards to

optimum-habana
Intel Neural Compressor (3.0 and 2.4)
tgi-gaudi

Currently the situation is that the libraries are in a state of disrepair for everything but bf16, as a result of a lack of unit testing, integration testing, regression testing. The examples do not work because they were designed for previous versions of libraries that are no longer working with the other versions of the libraries, the only thing that does work with regards to quantization is compile time quantization, which does not actually reduce the number of devices needed to run a model, but does increase the inference speed of the models, however with llama 3.1 405b it is currently impossible to run it on a single node, but only because the software packages are not being maintained in a functioning state.

I have spend 3 days so far on this endeavor, and I am unwilling to take the time needed to become a maintainer of those libraries, even if I do want to reduce hallucinations in my language modeling tasks, as I have been getting asked to complete my AGPL edge oriented mlops infrastructure package more quickly by @jaanli so that he can migrate away from google tpu cloud.

endomorphosis/ipfs_transformers_py#1 (comment)

jaanli · 2024-08-12T15:14:50Z

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5

Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

endomorphosis · 2024-08-12T15:21:57Z

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5

Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

I have no idea what hardware you are running it on.

jaanli · 2024-08-12T15:57:49Z

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5
Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

I have no idea what hardware you are running it on.

Ah yes, sorry - iPhone pro max 15 with latest firmware.

endomorphosis · 2024-08-26T18:08:49Z

HabanaAI/vllm-fork#144
I am going to continue with attempting to get llama 405b with speculative decoding with llama 8b working, and process some Wikipedia datasets and embeddings.

kevinintel added the DEV features label Jul 31, 2024

lkk12014402 pushed a commit that referenced this issue Aug 8, 2024

doc: fix minor issue in GMC doc (#383)

d994618

Signed-off-by: Ruoyu Ying <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama 3.1 405B fp8 support #383

llama 3.1 405B fp8 support #383

endomorphosis commented Jul 31, 2024

endomorphosis commented Aug 2, 2024

endomorphosis commented Aug 11, 2024

jaanli commented Aug 12, 2024

endomorphosis commented Aug 12, 2024

jaanli commented Aug 12, 2024

endomorphosis commented Aug 26, 2024

llama 3.1 405B fp8 support #383

llama 3.1 405B fp8 support #383

Comments

endomorphosis commented Jul 31, 2024

endomorphosis commented Aug 2, 2024

endomorphosis commented Aug 11, 2024

jaanli commented Aug 12, 2024

endomorphosis commented Aug 12, 2024

jaanli commented Aug 12, 2024

endomorphosis commented Aug 26, 2024