Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI-Examples] Add Candle ML framework example #1938

Closed
wants to merge 2 commits into from

Conversation

dimakuv
Copy link

@dimakuv dimakuv commented Jul 10, 2024

Description of the changes

Candle is a minimalist ML framework for Rust with a focus on performance and ease of use. This PR adds two examples with Candle: simple matrix multiplication (to quickly test functionality) and Quantized LLaMA (to test performance).

How to test this PR?

Follow the README.


This change is Reviewable

This is e.g. required by the gemm-common Rust crate, see
`gemm-common/src/cache.rs`. Without this file, the crate logic
incorrectly calculates shared-cpu count as zero and leads to a
division-by-zero exception.

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

a discussion (no related file):
#1937 is a prerequisite. Blocking.


Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

a discussion (no related file):
We need to decide where to put this example (and if we want this example at all). Most probably it should go to the separate Examples repo? Don't know.


a discussion (no related file):
I have two examples, not sure if both are needed. If we decide to leave only one, then I would prefer the Quantized LLaMA one, because it is much more complex and can be used for benchmarking.



CI-Examples/candle/Makefile line 25 at r1 (raw file):

	mkdir -p $(SRCDIR) && cd $(SRCDIR) && \
		cargo new candle_matmul && cd candle_matmul && \
		cargo add --git https://github.com/huggingface/candle.git candle-core && \

I hard-coded all URLs and SHA256 hashes for now, I'm not sure if it's worth to make them variables.

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)


CI-Examples/candle/candle_quantized.manifest.template line 8 at r1 (raw file):

loader.log_level = "{{ log_level }}"

loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}"

Must add RAYON_NUM_THREADS as a passthrough envvar, so that users can change the number of threads to run.


CI-Examples/candle/candle_quantized.manifest.template line 25 at r1 (raw file):

sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}
sgx.max_threads = {{ '1' if env.get('EDMM', '0') == '1' else '256' }}
sgx.enclave_size = "16G"

Need to bump to "32G". The original workload takes up to 5.5GB, and the enclave with 16GB (minus the ASLR adjustments, minus Gramine's internal state) may error out with ENOMEM.

@dimakuv dimakuv force-pushed the dimakuv/add-candle-rust-example branch from 18d0dbb to eb802a4 Compare July 12, 2024 09:22
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

a discussion (no related file):
Quick benchmark results for candle_quantized (LLAMA2 7B). Collected on a powerful SPR machine with 2 NUMA nodes and 72 physical cores (i.e. 36 physical cores on each node). Running the workload with 36 threads and pinned to 36 physical cores on NUMA node 0. SGX PRM (basically EPC) is configured with 32GB on each NUMA node.

  • Original workload:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
    ./candle_quantized --model llama-2-7b.ggmlv3.q4_0.bin --tokenizer tokenizer.json --sample-len 200
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 3.46s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
   6 prompt tokens processed: 9.50 token/s
 199 tokens generated: 6.07 token/s

  • gramine-direct:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
    gramine-direct ./candle_quantized
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 3.17s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
   6 prompt tokens processed: 3.01 token/s
 199 tokens generated: 1.56 token/s
  • gramine-sgx, no EDMM:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
    gramine-sgx ./candle_quantized
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 27.87s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
   6 prompt tokens processed: 2.36 token/s
 199 tokens generated: 6.83 token/s
  • gramine-sgx, with EDMM:
~/gramine/CI-Examples/candle$ RAYON_NUM_THREADS=36 numactl --cpunodebind=0 --membind=0 \
    gramine-sgx ./candle_quantized
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (3.79GB) in 43.01s
params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
model built
...
   6 prompt tokens processed: 0.07 token/s
 199 tokens generated: 5.87 token/s

To be honest, I don't know how to interpret these.



CI-Examples/candle/candle_quantized.manifest.template line 8 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Must add RAYON_NUM_THREADS as a passthrough envvar, so that users can change the number of threads to run.

Done


CI-Examples/candle/candle_quantized.manifest.template line 25 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Need to bump to "32G". The original workload takes up to 5.5GB, and the enclave with 16GB (minus the ASLR adjustments, minus Gramine's internal state) may error out with ENOMEM.

Done

Candle is a minimalist ML framework for Rust with a focus on performance
and ease of use. This commit adds two examples with Candle: simple
matrix multiplication (to quickly test functionality) and Quantized
LLaMA (to test performance).

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
@dimakuv dimakuv force-pushed the dimakuv/add-candle-rust-example branch from eb802a4 to e52efcd Compare July 12, 2024 09:23
@dimakuv dimakuv marked this pull request as ready for review July 12, 2024 10:08
Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)

a discussion (no related file):
Maybe better to include it in our examples repo (instead of CI-Examples) so that it will be tested before each release (instead of every time)?


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We need to decide where to put this example (and if we want this example at all). Most probably it should go to the separate Examples repo? Don't know.

I'm against it, I think it's not popular enough to justify the burden of maintaining it (but I'll be happy to change my mind if you prove that it's actually popular). Maybe Examples repo would be better, assuming someone wants to maintain it there.


@monavij
Copy link

monavij commented Jul 21, 2024

I'm against it, I think it's not popular enough to justify the burden of maintaining it (but I'll be happy to change my mind if you prove that it's actually popular). Maybe Examples repo would be better, assuming someone wants to maintain it there.

One reason is that its a good rust example and we don't have one in Examples. One way to make something popular is to use them, especially if we like them :-)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an example in Rust: https://github.com/gramineproject/gramine/tree/master/CI-Examples/rust.

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)

@dimakuv dimakuv force-pushed the dimakuv/add-cache-sysfs-shared-cpu-list branch from 6639193 to 7e44993 Compare July 24, 2024 06:36
Base automatically changed from dimakuv/add-cache-sysfs-shared-cpu-list to master July 24, 2024 14:13
@dimakuv
Copy link
Author

dimakuv commented Jul 26, 2024

Closing this PR. The fix to Gramine LibOS was merged in this repo, and the Candle example itself was moved to another repo: gramineproject/examples#104

@dimakuv dimakuv closed this Jul 26, 2024
@dimakuv dimakuv deleted the dimakuv/add-candle-rust-example branch July 26, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants