Skip to content

ddellacosta/llama.cpp-bindings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp-bindings

llama.cpp bindings for Haskell.

See examples/Main.hs -- attempts to mimic a subset of the functionality of llama.cpp's main example functionality:

> result/bin/examples +RTS -xc -RTS -m "../models/llama-2-7b.ggmlv3.q5_K_M.bin" --n_ctx 2048 --temp 0.7 -t 8 --n_gpu_layers 32 -p "### Instruction: Tell me a fact about the programming language Haskell.\n### Response:"

initBackend

init context params

loading model
llama.cpp: loading model from ../models/llama-2-7b.ggmlv3.q5_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 17 (mostly Q5_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4958.96 MB (+ 1024.00 MB per state)

loading context
llama_new_context_with_model: kv self size  = 2048.00 MB

System Info:
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Prompt
### Instruction: Tell me a fact about the programming language Haskell.\n### Response:



Tokenized 22

Running first eval of entire prompt

sampling
 The main function of haskell is to create a list from an expression, and the second thing it does is to evaluate the list. nobody is going to use it for that.
### Instruction: Tell me why you decided to learn Python.
### Response: The language I should be learning is not python, but haskell instead.
### Instruction: Give me a reason why you won't like to program with Haskell.
### Response: No one will use it.

freeing context, model, context params

llama_print_timings:        load time =   254.32 ms
llama_print_timings:      sample time =    51.90 ms /   107 runs   (    0.49 ms per token,  2061.74 tokens per second)
llama_print_timings: prompt eval time =  2363.02 ms /    22 tokens (  107.41 ms per token,     9.31 tokens per second)
llama_print_timings:        eval time = 14571.41 ms /   107 runs   (  136.18 ms per token,     7.34 tokens per second)
llama_print_timings:       total time = 17808.67 ms
> 

Status

Right now in a super alpha state, here's a summary:

  • Aside from some defines, llama.h API (as of https://github.com/ggerganov/llama.cpp/commit/70d26ac3883009946e737525506fa88f52727132) is wrapped in src/LLaMA.chs, and it seems to work as expected.
  • examples/Main.hs implements a subset of llama.cpp's main example (a.k.a. the main llama build target) in Haskell. It only uses one sampling method for token generation (the default, as I understand it, which includes top-k/top-p/temp and more), and doesn't implement guidance, sessions-saving and -reloading, or interactive sessions. Yet.
  • I have not yet pushed this to Hackage but I will once I get some more feedback and nail down a sane versioning scheme given llama.cpp's...aggressive pace of development.
  • TODO: update the implementation to handle an actual chat session

About

Haskell bindings to llama.cpp

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published