Fireworks AI Skill
Fast, cost-effective access to 100+ open-source models with OpenAI-compatible APIs, LoRA fine-tuning, and advanced deployment options.
When to Use This Skill
| Scenario | Example | Relevant Section |
|---|---|---|
| Query text models | "Chat completion with Llama" | Quick Reference → Chat Completion |
| Fine-tune a model | "Train model on my data" | Fine-Tuning Overview |
| Deploy custom model | "On-demand GPU deployment" | Deployments |
| Migrate from OpenAI | "Use OpenAI SDK with Fireworks" | OpenAI Compatibility |
| Batch processing | "Process 10K prompts offline" | Batch Inference |
| Image generation | "FLUX Kontext image editing" | Image Generation |
| Embeddings/RAG | "Generate embeddings for search" | Embeddings & Reranking |
| CLI operations | "firectl commands" | firectl Reference |
Quick Reference
Chat Completion (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="<YOUR_FIREWORKS_API_KEY>",
)
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"},
],
)
print(chat_completion.choices[0].message.content)
Chat Completion (curl)
curl --request POST \
--url https://api.fireworks.ai/inference/v1/chat/completions \
--header "accept: application/json" \
--header "authorization: Bearer $FIREWORKS_API_KEY" \
--header "content-type: application/json" \
--data '{
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Supervised Fine-Tuning Job
firectl supervised-fine-tuning-job create \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --dataset my-training-dataset \ --output-model my-fine-tuned-model \ --epochs 3 \ --learning-rate 1e-4 \ --lora-rank 8
Create Dataset for Fine-Tuning
from fireworks.client import Dataset
dataset = Dataset.from_file(
"path/to/training_data.jsonl",
name="my-training-dataset"
)
# Dataset is now available on Fireworks for fine-tuning
Monitor Training Progress
while not job.is_completed:
job.raise_if_bad_state()
print(f"Training state: {job.state}")
time.sleep(10)
job = job.get()
print(f"Training completed! New model: {job.output_model}")
Deploy Fine-Tuned Model (Multi-LoRA)
from fireworks import LLM
base_model = LLM(
model="accounts/fireworks/models/llama-v3p2-3b-instruct",
deployment_type="on-demand",
id="shared-base-deployment",
enable_addons=True
)
Generate Embeddings
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="<YOUR_FIREWORKS_API_KEY>",
)
response = client.embeddings.create(
model="fireworks/qwen3-embedding-8b",
input="Your text to embed"
)
embeddings = response.data[0].embedding
Export Billing Metrics
firectl billing export-metrics \ --start-time "2025-01-01" \ --end-time "2025-01-31" \ --filename january_metrics.csv
Create Deployment
firectl deployment create accounts/fireworks/models/deepseek-v3 \ --deployment-shape throughput
Key Concepts
Fine-Tuning Methods
| Method | Use Case | When to Use |
|---|---|---|
| SFT (Supervised) | Classification, extraction | Large labeled dataset (~1000+ examples) |
| RFT (Reinforcement) | Complex reasoning, agents | Small dataset, verifiable outputs, multi-step tasks |
| DPO (Preference) | Alignment, style | Pairwise preference comparisons |
Decision Tree:
- •Have 1000+ labeled examples? → SFT
- •Task is verifiable but lacks golden outputs? → RFT
- •Want to align with preferences? → DPO
LoRA (Low-Rank Adaptation)
Fireworks uses LoRA for efficient fine-tuning:
- •Faster & cheaper - Train in hours, not days
- •Easy to deploy - Instant deployment on Fireworks
- •Flexible - Run multiple LoRAs on single base deployment
Deployment Types
| Type | Use Case | Scaling |
|---|---|---|
| Serverless | Variable traffic, cost optimization | Auto-scale to zero |
| On-Demand | Consistent performance, high throughput | Dedicated GPUs |
| Reserved | Predictable workloads, discounts | Pre-purchased capacity |
Agent Tracing (RFT)
For reinforcement fine-tuning with agents:
- •Use
model_base_urlfrom trainer (points totracing.fireworks.ai) - •Attach
FireworksTracingHttpHandlerfor structured logging - •Log
Status.rollout_finished()orStatus.rollout_error()on completion - •Trainer joins traces + logs via
rollout_id
API Compatibility
Fireworks is OpenAI-compatible. Key differences:
| Feature | OpenAI | Fireworks |
|---|---|---|
max_tokens overflow | Error | Auto-truncate (configurable) |
| Streaming usage stats | Not returned | Returned in final chunk |
| Model names | gpt-4 | accounts/fireworks/models/llama-v3p1-8b-instruct |
Set context_length_exceeded_behavior: "error" for OpenAI-like behavior.
firectl CLI Quick Reference
# Authentication firectl login # Account operations firectl account list # Dataset operations firectl dataset download <dataset-id> firectl dataset list # Fine-tuning jobs firectl supervised-fine-tuning-job create --help firectl supervised-fine-tuning-job list firectl dpo-job resume <job-id> # Deployments firectl deployment create <model> --deployment-shape <shape> firectl deployment scale <deployment-id> --replicas <n> # Evaluators firectl evaluator-revision get <evaluator-id> # Billing firectl billing export-metrics
Available Models (Highlights)
Text Models:
- •DeepSeek V3, DeepSeek R1
- •Llama 3.1/3.2/3.3 (8B, 70B, 405B)
- •Qwen 2.5 family
- •Kimi K2
Embedding Models:
- •
fireworks/qwen3-embedding-8b(serverless) - •
fireworks/qwen3-embedding-4b - •
nomic-ai/nomic-embed-text-v1.5
Reranking Models:
- •
fireworks/qwen3-reranker-8b(serverless)
Image Models:
- •FLUX Kontext Pro/Max
- •SDXL ControlNet
Browse all: https://fireworks.ai/models
Reference Files
| File | Content | Use For |
|---|---|---|
references/llms-txt.md | Complete API reference (410 pages) | Detailed API docs, all CLI commands, parameters |
Navigation tips:
- •Search for specific CLI commands:
firectl <command> - •API endpoints follow pattern:
/v1/accounts/{account_id}/<resource> - •Fine-tuning docs under
#fine-tuning-*sections - •Deployment docs under
#deployment-*sections
Working with This Skill
For Beginners
- •Start with Chat Completion example above
- •Get API key from https://app.fireworks.ai
- •Use OpenAI SDK (familiar interface)
- •Try serverless models first (no deployment needed)
For Fine-Tuning
- •Prepare JSONL dataset with
messagesformat - •Upload with
Dataset.from_file()orfirectl - •Choose fine-tuning method (SFT/RFT/DPO)
- •Monitor with
firectl supervised-fine-tuning-job list - •Deploy LoRA or merge into base model
For Production
- •Consider on-demand deployments for consistent performance
- •Enable prompt caching for repeated prefixes
- •Use batch inference for offline processing
- •Monitor usage via billing export or dashboard
- •Set up service accounts for CI/CD
Common Patterns
Streaming with Usage Stats
for chunk in client.chat.completions.create(stream=True, ...):
if chunk.usage: # Available in final chunk
print(f"Tokens: {chunk.usage.total_tokens}")
Variable-Length Embeddings
response = client.embeddings.create(
model="fireworks/qwen3-embedding-8b",
input="Your text",
dimensions=128 # Reduce from default for faster similarity
)
Reranking Documents
# Using /rerank endpoint
response = client.post("/rerank", json={
"model": "fireworks/qwen3-reranker-8b",
"query": "search query",
"documents": ["doc1", "doc2", "doc3"]
})
Resources
- •Model Library: https://fireworks.ai/models
- •Playground: https://app.fireworks.ai/playground
- •Usage Dashboard: https://app.fireworks.ai/account/usage
- •API Reference: https://docs.fireworks.ai/api-reference
- •firectl Docs: https://docs.fireworks.ai/tools-sdks/firectl
Notes
- •Generated from official Fireworks AI documentation (410 pages)
- •OpenAI SDK examples work directly with Fireworks
- •Model names use
accounts/fireworks/models/<model-name>format - •Fine-tuning uses LoRA by default (set
--lora-rank 0for full parameter)