Experiments and benchmarks for the LROS project
On the Orange Pi Ultra that we have, cores [0-3] are Cortex-A55 (in-order -> more efficient) while cores [4-7] are Cortex-A76 (O3 -> more performant)
Clone with git clone [url] --recursive or execute git submodule update --init --recursive to initialize the submodules
To get the models, run the scripts/download_models.sh script.
- vaccel meson >= 1.1,
cd vaccel
cd scripts/common; git apply ../../submodules.patch; cd ../..
meson setup --buildtype=release build
meson compile -C build
meson install -C build --destdir=out
sed -i "s/prefix=/prefix=\/home\/$(whoami)\/lros-expe\/vaccel\/build\/out/" /home/$(whoami)/lros-expe/vaccel/build/out/usr/local/lib/pkgconfig/vaccel.pc
- lros-qemu Need: python3-tomli, libglib2.0-dev
cd lros-qemu
mkdir build
cd build
CFLAGS=-Wno-error PKG_CONFIG_PATH=/home/$(whoami)/lros-expe/vaccel/build/out/usr/local/lib/aarch64-linux-gnu/pkgconfig ../configure --target-list=aarch64-softmmu --enable-virtfs
make -j
- llama.cpp
cmake -B build
cmake --build build --config Release -j
- Find some suitable adapters for the model you want to infer: For example for the Llama-3.1-1b-Instruct model you can use Llama-TOS and MentalChat-16K.
- Clone the repo containg the adapter_config.json and adapter_model.safetensors files.
- Convert the LoRA into GGUF format using the convert_lora_to_gguf.py script from the llama.cpp repo
- Install the requirements using
pip install -r requirements/requirements-convert_lora_to_gguf.txt ./convert_lora_to_gguf.py --outfile <lora-name>.gguf --outtype f16 <cloned lora repo>
- Install the requirements using
- Start llama-server with the LoRAs: Add
--lora-scaled path/to/lora.gguf 0for every LoRA you want to supply.- Note: It should be possible to just do
--lora path/to/lora.ggufand additionally add--lora-init-without-applybut that did not work in my tests
- Note: It should be possible to just do
- Modify the applied LoRA(s) using:
- A
POSTrequest to/lora-adapterssupplying[{"id": 0, "scale": 0.2},{"id": 1, "scale": 0.8}]as the request body (not included LoRAs are automatically scaled to 0). - Per request by adding a
loraparameter to the json request body, that contains an array like above
- A