-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama 3.1 405B fp8 support #383
Comments
Signed-off-by: Ruoyu Ying <[email protected]>
I have thoroughly gone through all of the examples and interfaces with regards to optimum-habana Currently the situation is that the libraries are in a state of disrepair for everything but bf16, as a result of a lack of unit testing, integration testing, regression testing. The examples do not work because they were designed for previous versions of libraries that are no longer working with the other versions of the libraries, the only thing that does work with regards to quantization is compile time quantization, which does not actually reduce the number of devices needed to run a model, but does increase the inference speed of the models, however with llama 3.1 405b it is currently impossible to run it on a single node, but only because the software packages are not being maintained in a functioning state. I have spend 3 days so far on this endeavor, and I am unwilling to take the time needed to become a maintainer of those libraries, even if I do want to reduce hallucinations in my language modeling tasks, as I have been getting asked to complete my AGPL edge oriented mlops infrastructure package more quickly by @jaanli so that he can migrate away from google tpu cloud. |
Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5 Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :) |
I have no idea what hardware you are running it on. |
Ah yes, sorry - iPhone pro max 15 with latest firmware. |
HabanaAI/vllm-fork#144 |
I have been staging some updates testing the tgi-gaudi software with llama 405B fp8, i am waiting for habana optimum to approve the PR, and then I will submit a pr for huggingface/tgi_gaudi and will submit a PR for TGI in the microservices.
I got it running on xeon with llama_cpp (which is what ollama is based on) at 1 tok/s on sapphire rapids, but am going to test speculative decoding for llama 3.1 8b, which should improve performance 10-20 times depending on how many tokens can be completed by the draft model. However ollama is broken and that will need to be investigated further.
The text was updated successfully, but these errors were encountered: