Name: unsloth-inference
Rating: 88
Author: cuba6112

Overview

Unsloth models can be deployed using native optimized inference or through production serving engines like vLLM and SGLang. Native inference is accelerated 2x via for_inference(), while production serving requires merging LoRA weights into the base model.

When to Use

•When performing local testing or simple application deployment.
•When building high-throughput production endpoints using vLLM or SGLang.
•When creating OpenAI-compatible APIs for drop-in replacement in existing apps.

Decision Tree

•
Is throughput the priority?
- •Yes: Merge to 16-bit and use vLLM.
•
Is VRAM for serving extremely limited?
- •Yes: Merge to 4-bit or use GGUF via Ollama.
•
Running simple Python inference?
- •Yes: Call FastLanguageModel.for_inference(model).

Workflows

Native Optimized Inference

•Load fine-tuned model and tokenizer using FastLanguageModel.
•Call FastLanguageModel.for_inference(model) to enable optimized kernels.
•Use model.generate() with inputs formatted via the chat template.

Merging LoRA for vLLM Serving

•Select export method: 'merged_16bit' (quality) or 'merged_4bit' (VRAM).
•Run model.save_pretrained_merged("serving_model", tokenizer, save_method='merged_16bit').
•Start vLLM server pointing to the 'serving_model' directory.

Non-Obvious Insights

•Calling FastLanguageModel.for_inference(model) is mandatory for speed; it performs on-the-fly weight fusion and enables specialized Triton kernels for the forward pass.
•Most production serving engines (vLLM, SGLang) cannot use LoRA adapters directly without performance hits; merging them into the base model during export is the standard best practice.
•The unsloth-cli provides a pre-built OpenAI-compatible endpoint, making it easy to serve models locally and test with standard API clients.

Evidence

•"Unsloth itself provides 2x faster inference natively as well, so always do not forget to call FastLanguageModel.for_inference(model)." Source
•"model.save_pretrained_merged("output", tokenizer, save_method = "merged_16bit")" Source

Scripts

•scripts/unsloth-inference_tool.py: Script for running local optimized inference.
•scripts/unsloth-inference_tool.js: Helper to test OpenAI-compatible endpoints.

Dependencies

•unsloth
•torch
•vllm (optional for production)
•sglang (optional for production)

References

•[[references/README.md]]