-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI-Examples] Add Candle ML framework example #1938
Conversation
This is e.g. required by the gemm-common Rust crate, see `gemm-common/src/cache.rs`. Without this file, the crate logic incorrectly calculates shared-cpu count as zero and leads to a division-by-zero exception. Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
a discussion (no related file):
#1937 is a prerequisite. Blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
a discussion (no related file):
We need to decide where to put this example (and if we want this example at all). Most probably it should go to the separate Examples repo? Don't know.
a discussion (no related file):
I have two examples, not sure if both are needed. If we decide to leave only one, then I would prefer the Quantized LLaMA one, because it is much more complex and can be used for benchmarking.
CI-Examples/candle/Makefile
line 25 at r1 (raw file):
mkdir -p $(SRCDIR) && cd $(SRCDIR) && \ cargo new candle_matmul && cd candle_matmul && \ cargo add --git https://github.com/huggingface/candle.git candle-core && \
I hard-coded all URLs and SHA256 hashes for now, I'm not sure if it's worth to make them variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
CI-Examples/candle/candle_quantized.manifest.template
line 8 at r1 (raw file):
loader.log_level = "{{ log_level }}" loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}"
Must add RAYON_NUM_THREADS
as a passthrough envvar, so that users can change the number of threads to run.
CI-Examples/candle/candle_quantized.manifest.template
line 25 at r1 (raw file):
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }} sgx.max_threads = {{ '1' if env.get('EDMM', '0') == '1' else '256' }} sgx.enclave_size = "16G"
Need to bump to "32G". The original workload takes up to 5.5GB, and the enclave with 16GB (minus the ASLR adjustments, minus Gramine's internal state) may error out with ENOMEM.
18d0dbb
to
eb802a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
a discussion (no related file):
Quick benchmark results for candle_quantized
(LLAMA2 7B). Collected on a powerful SPR machine with 2 NUMA nodes and 72 physical cores (i.e. 36 physical cores on each node). Running the workload with 36 threads and pinned to 36 physical cores on NUMA node 0. SGX PRM (basically EPC) is configured with 32GB on each NUMA node.
- Original workload:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
./candle_quantized --model llama-2-7b.ggmlv3.q4_0.bin --tokenizer tokenizer.json --sample-len 200
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 3.46s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
6 prompt tokens processed: 9.50 token/s
199 tokens generated: 6.07 token/s
gramine-direct
:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
gramine-direct ./candle_quantized
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 3.17s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
6 prompt tokens processed: 3.01 token/s
199 tokens generated: 1.56 token/s
gramine-sgx
, no EDMM:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
gramine-sgx ./candle_quantized
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 27.87s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
6 prompt tokens processed: 2.36 token/s
199 tokens generated: 6.83 token/s
gramine-sgx
, with EDMM:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
gramine-sgx ./candle_quantized
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 43.01s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
6 prompt tokens processed: 0.07 token/s
199 tokens generated: 5.87 token/s
To be honest, I don't know how to interpret these.
CI-Examples/candle/candle_quantized.manifest.template
line 8 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Must add
RAYON_NUM_THREADS
as a passthrough envvar, so that users can change the number of threads to run.
Done
CI-Examples/candle/candle_quantized.manifest.template
line 25 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Need to bump to "32G". The original workload takes up to 5.5GB, and the enclave with 16GB (minus the ASLR adjustments, minus Gramine's internal state) may error out with ENOMEM.
Done
Candle is a minimalist ML framework for Rust with a focus on performance and ease of use. This commit adds two examples with Candle: simple matrix multiplication (to quickly test functionality) and Quantized LLaMA (to test performance). Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
eb802a4
to
e52efcd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
a discussion (no related file):
Maybe better to include it in our examples repo (instead of CI-Examples) so that it will be tested before each release (instead of every time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
We need to decide where to put this example (and if we want this example at all). Most probably it should go to the separate Examples repo? Don't know.
I'm against it, I think it's not popular enough to justify the burden of maintaining it (but I'll be happy to change my mind if you prove that it's actually popular). Maybe Examples repo would be better, assuming someone wants to maintain it there.
One reason is that its a good rust example and we don't have one in Examples. One way to make something popular is to use them, especially if we like them :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have an example in Rust: https://github.com/gramineproject/gramine/tree/master/CI-Examples/rust.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
6639193
to
7e44993
Compare
Closing this PR. The fix to Gramine LibOS was merged in this repo, and the Candle example itself was moved to another repo: gramineproject/examples#104 |
Description of the changes
Candle is a minimalist ML framework for Rust with a focus on performance and ease of use. This PR adds two examples with Candle: simple matrix multiplication (to quickly test functionality) and Quantized LLaMA (to test performance).
How to test this PR?
Follow the README.
This change is