/var/log

Using Llama.cpp with Vscode

Download a file from HugginfFace:

llama-cli \
  --hf-repo unsloth/Qwen3.5-9B-GGUF \
  --hf-file Qwen3.5-9B-UD-Q8_K_XL.gguf \
  -ngl 99 \
  -c 32768 \
  --cache-type-k bf16 \
  --cache-type-v bf16

This means:

  • --hf-repo - HF repo URL
  • --hf-file - HF model filename
  • -ngl - (Number of GPU Layers): Forces llama.cpp to load all 32 layers of the 9B model into your Mac's GPU memory.
  • -c 32768 - Context Size, in this 32k tokens (this is why we can use bf16 for KV cache size)
  • --cache-type-k bf16 / --cache-type-v bf16 - Prevents text generation from degrading into unreadable gibberish when the model processes extended multi-file coding workflows.

If you want to run this model with more context, you have to set the KV cache to a smaller size:

llama-cli -m Qwen3.5-9B-UD-Q8_K_XL.gguf \
  -c 131072 \
  -ngl 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -t 8

This way it still runs on a Macbook Pro M1 with 32GB of RAM.

Note that if you later want to run the same model, just use the same command (including the --hf-repo and --hf-file parameters. Instead of re-downloading the model, Llama-cli will use the one you downloaded from the cache directory (by default on Mac ~/.cache/huggingface/hub). The downloaded files will be stored inside ~/.cache/huggingface/hub/models--[author]--[repo_name]/snapshots/[commit_hash]/[your_model].gguf.

Now in Vscode, open the settings for the models and choose something like "Manage Language Models" and then choose "Add a new model". This will open a JSON file where you can enter the details of this custom local model. An example is below:

[
    {
        "name": "http://localhost:8080",
        "vendor": "customendpoint",
        "apiKey": "${input:chat.lm.secret.3d115400}",
        "apiType": "chat-completions",
        "models": [
            {
                "id": "unsloth/Qwen3.5-9B-GGUF:Q8_K_XL",
                "name": "Qwen3.5-9B",
                "url": "http://localhost:8080",
                "toolCalling": true,
                "vision": true,
                "maxInputTokens": 128000,
                "maxOutputTokens": 16000
            }
        ]
    }
]

Make sure to first start the llama-server before using this model from within Vscode!

Category: