Download a file from HugginfFace:
llama-cli \
--hf-repo unsloth/Qwen3.5-9B-GGUF \
--hf-file Qwen3.5-9B-UD-Q8_K_XL.gguf \
-ngl 99 \
-c 32768 \
--cache-type-k bf16 \
--cache-type-v bf16
This means:
- --hf-repo - HF repo URL
- --hf-file - HF model filename
- -ngl - (Number of GPU Layers): Forces llama.cpp to load all 32 layers of the 9B model into your Mac's GPU memory.
- -c 32768 - Context Size, in this 32k tokens (this is why we can use bf16 for KV cache size)
- --cache-type-k bf16 / --cache-type-v bf16 - Prevents text generation from degrading into unreadable gibberish when the model processes extended multi-file coding workflows.
If you want to run this model with more context, you have to set the KV cache to a smaller size:
llama-cli -m Qwen3.5-9B-UD-Q8_K_XL.gguf \
-c 131072 \
-ngl 99 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-t 8
This way it still runs on a Macbook Pro M1 with 32GB of RAM.
Note that if you later want to run the same model, just use the same command (including the --hf-repo and --hf-file parameters. Instead of re-downloading the model, Llama-cli will use the one you downloaded from the cache directory (by default on Mac ~/.cache/huggingface/hub). The downloaded files will be stored inside ~/.cache/huggingface/hub/models--[author]--[repo_name]/snapshots/[commit_hash]/[your_model].gguf.
Now in Vscode, open the settings for the models and choose something like "Manage Language Models" and then choose "Add a new model". This will open a JSON file where you can enter the details of this custom local model. An example is below:
[
{
"name": "http://localhost:8080",
"vendor": "customendpoint",
"apiKey": "${input:chat.lm.secret.3d115400}",
"apiType": "chat-completions",
"models": [
{
"id": "unsloth/Qwen3.5-9B-GGUF:Q8_K_XL",
"name": "Qwen3.5-9B",
"url": "http://localhost:8080",
"toolCalling": true,
"vision": true,
"maxInputTokens": 128000,
"maxOutputTokens": 16000
}
]
}
]
Make sure to first start the llama-server before using this model from within Vscode!