Ollama Configuration: ollama-rs error #2750

nehirenu · 2024-07-30T08:47:28Z

nehirenu
Jul 30, 2024

Hello,

I am trying to configure Ollama and encountered an issue after modifying the config.toml file as shown below:

Here is my docker-compose.yml file:

When I send the the curl command as below ı got the respond :

curl http://localhost:11434/api/show -d '{
  "name": "starcoder:1b"
}'

Respond:

{"modelfile":"# Modelfile generated by "ollama show"\n# To build a new Modelfile based on this, replace FROM with:\n# FROM starcoder:1b\n\nFROM /root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf\nTEMPLATE {{ .Prompt }}\nPARAMETER stop \u003c|endoftext|\u003e\n","parameters":"stop "\u003c|endoftext|\u003e"","template":"{{ .Prompt }}","details":{"parent_model":"","format":"gguf","family":"starcoder","families":null,"parameter_size":"1B","quantization_level":"Q4_0"},"model_info":{"general.architecture":"starcoder","general.file_type":2,"general.parameter_count":1237870592,"general.quantization_version":2,"starcoder.attention.head_count":16,"starcoder.attention.head_count_kv":1,"starcoder.attention.layer_norm_epsilon":0.00001,"starcoder.block_count":24,"starcoder.context_length":8192,"starcoder.embedding_length":2048,"starcoder.feed_forward_length":8192,"tokenizer.ggml.bos_token_id":0,"tokenizer.ggml.eos_token_id":0,"tokenizer.ggml.merges":null,"tokenizer.ggml.model":"gpt2","tokenizer.ggml.scores":null,"tokenizer.ggml.token_type":null,"tokenizer.ggml.tokens":null,"tokenizer.ggml.unknown_token_id":0},"modified_at":"2024-07-29T13:04:37.756033766Z"}

Similarly:
curl http://localhost:11434/api/tags

Respond:

{"models":[{"name":"starcoder:1b","model":"starcoder:1b","modified_at":"2024-07-29T13:04:37.756033766Z","size":726080827,"digest":"77e6c46054d95d9c92f96c93df31948ed64116416a0d1ce2b882ca1641d71625","details":{"parent_model":"","format":"gguf","family":"starcoder","families":null,"parameter_size":"1B","quantization_level":"Q4_0"}}]}

Error

The error I am getting is:

Details

When I check the Ollama logs (running as a Docker container), there are no logs from Tabby. Requests from the browser are visible in the logs, but when I run Tabby, no logs are inserted.

Investigation

The issue seems to be in the tabby/crates/ollama-api-bindings/src/model.rs file, where the program runs into the else statement on line 55.

Request

Could anyone help me diagnose and fix this issue? Your assistance would be greatly appreciated!

Thank you!

Answered by wsxiaoys

Jul 30, 2024

Hi - have you checked whether you're able to access http://localhost:11434 from within the container? You might need to use the network: host option to achieve that.

View full answer

wsxiaoys · 2024-07-30T14:31:20Z

wsxiaoys
Jul 30, 2024
Maintainer

Hi - have you checked whether you're able to access http://localhost:11434 from within the container? You might need to use the network: host option to achieve that.

2 replies

nehirenu Jul 31, 2024
Author

@wsxiaoys Thanks a lot for your quick response. I have still a problem with having code suggestions. If you can help I would greatly appreciate it.

liuguanglong233 Feb 24, 2025

This is useful. I am running Tabby on Docker Desktop. There are some issues with the Ollama interface documentation; you should not use "ollama/chat", "ollama/completion", or "ollama/embedding". Instead, use "openai/chat", "openai/completion", and "openai/embedding".

Here is the corrected configuration in TOML format:

[model.chat.http]
kind = "openai/chat"
model_name = "qwen2.5:1.5b"
api_endpoint = "http://host.docker.internal:11434/v1"

[model.completion.http]
kind = "openai/completion"
model_name = "qwen2.5-coder:7b-instruct"
api_endpoint = "http://host.docker.internal:11434/v1"
prompt_template = ""

[model.embedding.http]
kind = "openai/embedding"
model_name = "nomic-embed-text"
api_endpoint = "http://host.docker.internal:11434/v1"

And here is the corresponding Docker Compose configuration:

version: '3.8'
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:cuda
    container_name: open-webui_test
    restart: always
    ports:
      - "3339:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - NVIDIA_VISIBLE_DEVICES=all  
    extra_hosts:
      - "host.docker.internal:host-gateway"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
  tabby:
    image: registry.tabbyml.com/tabbyml/tabby
    container_name: tabby_test
    command: serve
    restart: always
    volumes:
      - "$HOME/.tabby:/data"
    ports:
      - "8080:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  open-webui:
    external: true

nehirenu · 2024-07-31T08:15:40Z

nehirenu
Jul 31, 2024
Author

Hello,

Thank you for your response.

With your advice, the Docker container is up and running. I checked http://localhost:11434, and it confirms that Ollama is running.

However, I am encountering another issue now. Tabby is not suggesting any codes. The web server is up, and it is connected to the VSCode extension.

When ı try this command:

curl http://localhost:11434/api/generate -d '{
  "model": "starcoder:1b",
  "prompt": "def isPrime():"
}'

I am getting eventually:

{"model":"starcoder:1b","created_at":"2024-07-31T07:37:39.724443898Z","response":"","done":true,"done_reason":"stop","context":[589,438,25808,2262,284,436,310,328,2155,26,36,30,629,26,1131,1157,225,29,225,35,711,291,415,310,225,23,225,36,225,379,225,34,44,324,4036,11268,21,1670,996,1159,3425,10925,322,2401,16994,1615,996,883,25,102,526,16853,276,21,1348,438,3301,312,4609,432,3134,883,25,102,526,16853,32,291,415,310,225,48,225,35,461,310,225,23,225,37,225,379,225,34,44,324,1626,584,415,310,225,48,225,35,44,324,436,595,328,2155,26,39,30,629,26,3812,32,8663,26,96,490,225,29,225,35,30,225,40,711,400,415,310,225,23,595,225,379,225,34,44,528,1626,324,813,44,400,442,3177,584,415,310,225,48,225,36,44,324,436,595,328,2155,26,39,30,629,1162,3812,32,8663,26,96,27,225,31,225,35,490,225,33,225,36,225,29,225,37,30,225,36,711,400,415,310,225,23,595,225,379,225,34,44,528,1626,324,813,44,400,442,3177,8062,21,1670,996,622,372,458,2260,1615,322,1451,438,16853,284,442,2933,478,203,589,2575,2262,284,310,225,47,629,26,1131,1157,8062,21,2648,312,1149,432,3031,7995,973,372,17058,26,96,27,284,3031,7995,225,47,225,1178,284,436,595,328,2155,26,36,30,225,26,410,26,3812,32,8663,26,96,490,225,29,225,35,8531,291,415,438,25808,2262,324,3031,7995,32,1689,26,91,27,8062,21,14883,322,3632,432,1169,16853,23851,284,16853,4934,225,47,225,34,284,436,298,328,3031,7995,44,291,16853,4934,225,9372,298,225,28,7744,32,9410,564,26,98,225,31,225,35,27,8062,21,9194,322,7592,284,1459,26,17720,4934,27,478,203,325,225,505,426,505,225,379,225,25,505,1831,14935,284,2575,346],"total_duration":5092661819,"load_duration":11158282,"prompt_eval_count":4,"prompt_eval_duration":16639000,"eval_count":320,"eval_duration":5023329000}

when ı looked at the ollama's logs :

✔ Container ollama Created 0.0s
Attaching to ollama
ollama | 2024/07/31 07:49:43 routes.go:1099: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
ollama | time=2024-07-31T07:49:43.629Z level=INFO source=images.go:784 msg="total blobs: 3"
ollama | time=2024-07-31T07:49:43.629Z level=INFO source=images.go:791 msg="total unused blobs removed: 0"
ollama | time=2024-07-31T07:49:43.630Z level=INFO source=routes.go:1146 msg="Listening on [::]:11434 (version 0.3.0)"
ollama | time=2024-07-31T07:49:43.630Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3792333155/runners
ollama | time=2024-07-31T07:49:45.770Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60102]"
ollama | time=2024-07-31T07:49:45.770Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
ollama | time=2024-07-31T07:49:45.819Z level=INFO source=types.go:105 msg="inference compute" id=GPU-411bf0d2-797f-fc85-4133-95556f31a783 library=cuda compute=6.1 driver=12.2 name="Quadro P1000" total="3.9 GiB" available="3.3 GiB"
ollama | time=2024-07-31T07:49:54.834Z level=WARN source=types.go:498 msg="invalid option provided" option=""
ollama | time=2024-07-31T07:49:54.890Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf gpu=GPU-411bf0d2-797f-fc85-4133-95556f31a783 parallel=4 available=3577413632 required="1.3 GiB"
ollama | time=2024-07-31T07:49:54.891Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[3.3 GiB]" memory.required.full="1.3 GiB" memory.required.partial="1.3 GiB" memory.required.kv="72.0 MiB" memory.required.allocations="[1.3 GiB]" memory.weights.total="620.8 MiB" memory.weights.repeating="542.1 MiB" memory.weights.nonrepeating="78.8 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
ollama | time=2024-07-31T07:49:54.891Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3792333155/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf --ctx-size 6144 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 4 --port 34905"
ollama | time=2024-07-31T07:49:54.891Z level=INFO source=sched.go:437 msg="loaded runners" count=1
ollama | time=2024-07-31T07:49:54.891Z level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
ollama | time=2024-07-31T07:49:54.891Z level=WARN source=server.go:590 msg="client connection closed before server finished loading, aborting load"
ollama | time=2024-07-31T07:49:54.891Z level=ERROR source=sched.go:443 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
ollama | [GIN] 2024/07/31 - 07:49:54 | 499 | 71.031822ms | 172.19.0.1 | POST "/api/generate"
ollama | INFO [main] build info | build=1 commit="d94c6e0" tid="129731468062720" timestamp=1722412194
ollama | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="129731468062720" timestamp=1722412194 total_threads=24
ollama | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="34905" tid="129731468062720" timestamp=1722412194
ollama | llama_model_loader: loaded meta data with 19 key-value pairs and 293 tensors from /root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf (version GGUF V2)
ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama | llama_model_loader: - kv 0: general.architecture str = starcoder
ollama | llama_model_loader: - kv 1: general.name str = StarCoder
ollama | llama_model_loader: - kv 2: starcoder.context_length u32 = 8192
ollama | llama_model_loader: - kv 3: starcoder.embedding_length u32 = 2048
ollama | llama_model_loader: - kv 4: starcoder.feed_forward_length u32 = 8192
ollama | llama_model_loader: - kv 5: starcoder.block_count u32 = 24
ollama | llama_model_loader: - kv 6: starcoder.attention.head_count u32 = 16
ollama | llama_model_loader: - kv 7: starcoder.attention.head_count_kv u32 = 1
ollama | llama_model_loader: - kv 8: starcoder.attention.layer_norm_epsilon f32 = 0.000010
ollama | llama_model_loader: - kv 9: general.file_type u32 = 2
ollama | llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
ollama | llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<fim_prefix>", "<f...
ollama | llama_model_loader: - kv 12: tokenizer.ggml.scores arr[f32,49152] = [0.000000, 0.000000, 0.000000, 0.0000...
ollama | llama_model_loader: - kv 13: tokenizer.ggml.token_type arr[i32,49152] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama | time=2024-07-31T07:49:59.932Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.040806502 model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf
ollama | time=2024-07-31T07:50:00.182Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.290325824 model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf
ollama | time=2024-07-31T07:50:00.239Z level=WARN source=types.go:498 msg="invalid option provided" option=""
ollama | time=2024-07-31T07:50:00.282Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf gpu=GPU-411bf0d2-797f-fc85-4133-95556f31a783 parallel=4 available=3577413632 required="1.3 GiB"
ollama | time=2024-07-31T07:50:00.282Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[3.3 GiB]" memory.required.full="1.3 GiB" memory.required.partial="1.3 GiB" memory.required.kv="72.0 MiB" memory.required.allocations="[1.3 GiB]" memory.weights.total="620.8 MiB" memory.weights.repeating="542.1 MiB" memory.weights.nonrepeating="78.8 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
ollama | time=2024-07-31T07:50:00.282Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3792333155/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf --ctx-size 6144 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 4 --port 32995"
ollama | time=2024-07-31T07:50:00.282Z level=INFO source=sched.go:437 msg="loaded runners" count=1
ollama | time=2024-07-31T07:50:00.282Z level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
ollama | time=2024-07-31T07:50:00.283Z level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
ollama | INFO [main] build info | build=1 commit="d94c6e0" tid="140132384473088" timestamp=1722412200
ollama | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140132384473088" timestamp=1722412200 total_threads=24
ollama | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="32995" tid="140132384473088" timestamp=1722412200
ollama | llama_model_loader: loaded meta data with 19 key-value pairs and 293 tensors from /root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf (version GGUF V2)
ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama | llama_model_loader: - kv 0: general.architecture str = starcoder
ollama | llama_model_loader: - kv 1: general.name str = StarCoder
ollama | llama_model_loader: - kv 2: starcoder.context_length u32 = 8192
ollama | llama_model_loader: - kv 3: starcoder.embedding_length u32 = 2048
ollama | llama_model_loader: - kv 4: starcoder.feed_forward_length u32 = 8192
ollama | llama_model_loader: - kv 5: starcoder.block_count u32 = 24
ollama | llama_model_loader: - kv 6: starcoder.attention.head_count u32 = 16
ollama | llama_model_loader: - kv 7: starcoder.attention.head_count_kv u32 = 1
ollama | llama_model_loader: - kv 8: starcoder.attention.layer_norm_epsilon f32 = 0.000010
ollama | llama_model_loader: - kv 9: general.file_type u32 = 2
ollama | llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
ollama | llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<fim_prefix>", "<f...
ollama | llama_model_loader: - kv 12: tokenizer.ggml.scores arr[f32,49152] = [0.000000, 0.000000, 0.000000, 0.0000...
ollama | llama_model_loader: - kv 13: tokenizer.ggml.token_type arr[i32,49152] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama | llama_model_loader: - kv 14: tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
ollama | llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 0
ollama | llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 0
ollama | llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
ollama | llama_model_loader: - kv 18: general.quantization_version u32 = 2
ollama | llama_model_loader: - type f32: 194 tensors
ollama | llama_model_loader: - type q4_0: 98 tensors
ollama | llama_model_loader: - type q6_K: 1 tensors
ollama | llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
ollama | llm_load_vocab: special tokens cache size = 0
ollama | llm_load_vocab: token to piece cache size = 0.2826 MB
ollama | llm_load_print_meta: format = GGUF V2
ollama | llm_load_print_meta: arch = starcoder
ollama | llm_load_print_meta: vocab type = BPE
ollama | llm_load_print_meta: n_vocab = 49152
ollama | llm_load_print_meta: n_merges = 48891
ollama | llm_load_print_meta: vocab_only = 0
ollama | llm_load_print_meta: n_ctx_train = 8192
ollama | llm_load_print_meta: n_embd = 2048
ollama | llm_load_print_meta: n_layer = 24
ollama | llm_load_print_meta: n_head = 16
ollama | llm_load_print_meta: n_head_kv = 1
ollama | llm_load_print_meta: n_rot = 128
ollama | llm_load_print_meta: n_swa = 0
ollama | llm_load_print_meta: n_embd_head_k = 128
ollama | llm_load_print_meta: n_embd_head_v = 128
ollama | llm_load_print_meta: n_gqa = 16
ollama | llm_load_print_meta: n_embd_k_gqa = 128
ollama | llm_load_print_meta: n_embd_v_gqa = 128
ollama | llm_load_print_meta: f_norm_eps = 1.0e-05
ollama | llm_load_print_meta: f_norm_rms_eps = 0.0e+00
ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama | llm_load_print_meta: n_ff = 8192
ollama | llm_load_print_meta: n_expert = 0
ollama | llm_load_print_meta: n_expert_used = 0
ollama | llm_load_print_meta: causal attn = 1
ollama | llm_load_print_meta: pooling type = 0
ollama | llm_load_print_meta: rope type = 0
ollama | llm_load_print_meta: rope scaling = linear
ollama | llm_load_print_meta: freq_base_train = 10000.0
ollama | llm_load_print_meta: freq_scale_train = 1
ollama | llm_load_print_meta: n_ctx_orig_yarn = 8192
ollama | llm_load_print_meta: rope_finetuned = unknown
ollama | llm_load_print_meta: ssm_d_conv = 0
ollama | llm_load_print_meta: ssm_d_inner = 0
ollama | llm_load_print_meta: ssm_d_state = 0
ollama | llm_load_print_meta: ssm_dt_rank = 0
ollama | llm_load_print_meta: model type = 1B
ollama | llm_load_print_meta: model ftype = Q4_0
ollama | llm_load_print_meta: model params = 1.24 B
ollama | llm_load_print_meta: model size = 690.60 MiB (4.68 BPW)
ollama | llm_load_print_meta: general.name = StarCoder
ollama | llm_load_print_meta: BOS token = 0 '<|endoftext|>'
ollama | llm_load_print_meta: EOS token = 0 '<|endoftext|>'
ollama | llm_load_print_meta: UNK token = 0 '<|endoftext|>'
ollama | llm_load_print_meta: LF token = 145 'Ä'
ollama | llm_load_print_meta: EOT token = 0 '<|endoftext|>'
ollama | llm_load_print_meta: max token length = 512
ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama | ggml_cuda_init: found 1 CUDA devices:
ollama | Device 0: Quadro P1000, compute capability 6.1, VMM: yes
ollama | time=2024-07-31T07:50:00.432Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.540560796 model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf
ollama | llm_load_tensors: ggml ctx size = 0.26 MiB
ollama | llm_load_tensors: offloading 24 repeating layers to GPU
ollama | llm_load_tensors: offloading non-repeating layers to GPU
ollama | llm_load_tensors: offloaded 25/25 layers to GPU
ollama | llm_load_tensors: CPU buffer size = 63.00 MiB
ollama | llm_load_tensors: CUDA0 buffer size = 627.60 MiB
ollama | time=2024-07-31T07:50:00.534Z level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
ollama | llama_new_context_with_model: n_ctx = 6144
ollama | llama_new_context_with_model: n_batch = 512
ollama | llama_new_context_with_model: n_ubatch = 512
ollama | llama_new_context_with_model: flash_attn = 0
ollama | llama_new_context_with_model: freq_base = 10000.0
ollama | llama_new_context_with_model: freq_scale = 1
ollama | llama_kv_cache_init: CUDA0 KV buffer size = 72.00 MiB
ollama | llama_new_context_with_model: KV self size = 72.00 MiB, K (f16): 36.00 MiB, V (f16): 36.00 MiB
ollama | llama_new_context_with_model: CUDA_Host output buffer size = 0.78 MiB
ollama | llama_new_context_with_model: CUDA0 compute buffer size = 212.00 MiB
ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB
ollama | llama_new_context_with_model: graph nodes = 897
ollama | llama_new_context_with_model: graph splits = 2
ollama | INFO [main] model loaded | tid="140132384473088" timestamp=1722412200
ollama | time=2024-07-31T07:50:00.785Z level=INFO source=server.go:622 msg="llama runner started in 0.50 seconds"
ollama | [GIN] 2024/07/31 - 07:50:01 | 200 | 796.965706ms | 172.19.0.1 | POST "/api/generate"

Also a while later it gives some warnings about cuda:

ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:14.718Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:14.989Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:15.231Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:15.487Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:15.738Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:15.989Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:16.235Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:16.482Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:16.738Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:16.989Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:17.231Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:17.489Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:17.738Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:17.988Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:18.239Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:18.481Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:18.739Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:18.988Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:19.232Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:19.481Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | time=2024-07-31T07:56:19.718Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.023418569 model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:19.737Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | time=2024-07-31T07:56:19.969Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.274177067 model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf
ollama | cuda driver library failed to get device context 800time=2024-07-31T07:56:19.989Z level=WARN source=gpu.go:399 msg="error looking up nvidia GPU memory"
ollama | time=2024-07-31T07:56:20.218Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.523795495 model=/root/.ollama/models/blobs/sha256-397f02a8d32c293bcb63e2578a03a3d8430d8ec744f5b3180cc677e702fcd2cf

Images:

Request:

If you have any suggestions on how to resolve this, I would greatly appreciate it.

Thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ollama Configuration: ollama-rs error #2750

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ollama Configuration: ollama-rs error #2750

nehirenu Jul 30, 2024

Error

Details

Investigation

Request

Replies: 2 comments · 2 replies

wsxiaoys Jul 30, 2024 Maintainer

nehirenu Jul 31, 2024 Author

liuguanglong233 Feb 24, 2025

nehirenu Jul 31, 2024 Author

nehirenu
Jul 30, 2024

Replies: 2 comments 2 replies

wsxiaoys
Jul 30, 2024
Maintainer

nehirenu Jul 31, 2024
Author

nehirenu
Jul 31, 2024
Author