Overview
Unsloth provides a streamlined method to export fine-tuned models directly to GGUF format. It features "Dynamic 2.0" quantization, which protects sensitive weights to maintain high accuracy, and automates the merging of LoRA adapters.
When to Use
- •When deploying models to local serving platforms like Ollama, llama.cpp, or LM Studio.
- •When model size needs to be minimized for CPU-based inference or low-VRAM GPUs.
- •When sharing models with the community via GGUF format.
Decision Tree
- •Is target VRAM very low?
- •Yes: Use
quantization_method = 'q4_k_m'or higher compression. - •No: Use
q8_0orf16for maximum quality.
- •Yes: Use
- •Deploying to Ollama?
- •Yes: Export to GGUF and then create a
Modelfilewith aFROMcommand.
- •Yes: Export to GGUF and then create a
Workflows
Exporting Fine-tuned Models to GGUF
- •After training, call
model.save_pretrained_gguf("name", tokenizer, quantization_method='q4_k_m'). - •Specify quantization method (e.g.,
q4_k_m,q8_0,f16) based on target VRAM. - •Wait for the script to download llama.cpp and perform conversion automatically.
Deploying to Ollama
- •Export model to GGUF using the native Unsloth save function.
- •Create a 'Modelfile' containing:
FROM ./model-q4_k_m.gguf. - •Run
ollama create my-model -f Modelfileto import and serve.
Non-Obvious Insights
- •Unsloth 'Dynamic 2.0' GGUFs are superior to standard GGUFs because they dynamically identify and protect weights that are sensitive to quantization, leading to higher MMLU scores.
- •The GGUF export process handles the complex task of merging LoRA layers back into the base weights automatically, ensuring the resulting file is a standalone model.
- •Unsloth supports direct Hub uploading for GGUFs, removing the need for local storage during the export-to-share pipeline.
Evidence
- •"model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")" Source
- •"Unsloth Dynamic 4-bit Quantization! We dynamically opt not to quantize certain parameters and this greatly increases accuracy." Source
Scripts
- •
scripts/unsloth-gguf_tool.py: Python helper for automated GGUF export. - •
scripts/unsloth-gguf_tool.js: Utility to generate Ollama Modelfiles.
Dependencies
- •unsloth
- •llama-cpp-python (or local llama.cpp binary)
- •huggingface_hub
References
- •[[references/README.md]]