ReinforceNow Configuration
This guide covers config.yml and train.jsonl setup for RL, SFT, and Distillation training.
Project Structure
my_project/ ├── config.yml # Training configuration (required) ├── train.jsonl # Training data (required) ├── rewards.py # Reward functions (required for RL, not needed for SFT/Distillation) ├── tools.py # Tool definitions (optional, RL only) └── requirements.txt # Python dependencies (optional)
config.yml
Minimal RL Config
project_name: "My RL Project" dataset_type: rl data: train_file: train.jsonl batch_size: 4 group_size: 8 model: path: Qwen/Qwen3-8B trainer: num_epochs: 10 learning_rate: 0.0001
Minimal SFT Config
project_name: "My SFT Project" dataset_type: sft data: train_file: train.jsonl batch_size: 4 val_split: 0.2 model: path: Qwen/Qwen3-8B trainer: num_epochs: 10 learning_rate: 0.0001
Minimal Distillation Config
On-policy distillation trains a student model to match a teacher model's behavior. The student generates, the teacher grades each token, and KL divergence provides supervision.
project_name: "My Distillation Project" dataset_type: distill data: train_file: train.jsonl batch_size: 8 group_size: 4 model: path: Qwen/Qwen3-8B # Student model teacher: path: Qwen/Qwen3-32B # Teacher model (larger) rollout: max_context_window: 8192 trainer: num_epochs: 3 learning_rate: 0.0001
Key points:
- •No
rewards.pyneeded - teacher provides all supervision via KL penalty - •Student generates on its own distribution (on-policy)
- •Teacher computes log probabilities for each token
- •KL penalty coefficient is 1.0 (full weight on teacher supervision)
Full RL Config (All Options)
# Project identification (auto-filled on first run) project_id: "" project_name: "My RL Project" dataset_id: "" dataset_type: rl organization_id: "" description: "Training description" # Data configuration data: train_file: train.jsonl # Path to training data batch_size: 16 # 1-32, prompts per batch group_size: 4 # 1-64, rollouts per prompt (RL only) # NOTE: batch_size * group_size <= 2048 # Model configuration model: path: Qwen/Qwen3-8B # Model name or checkpoint ID qlora_rank: 32 # LoRA rank (model-specific max) qlora_alpha: 64 # LoRA alpha (default: rank * 2) name: "custom-model-name" # Optional output name description: "Model desc" # Optional description # RL algorithm (RL only) algorithm: loss_fn: ppo # 'ppo' or 'importance_sampling' adv_estimator: grpo # 'grpo', 'gae', or 'reinforce' kl_penalty_coef: 0.01 # KL divergence penalty # Rollout configuration (RL only) rollout: max_turns: 1 # Max conversation turns max_context_window: 2048 # Max tokens per generation termination_policy: last_tool # 'last_tool' or 'max_turns' thinking_mode: null # null, 'disabled', 'easy', 'medium', 'hard' mcp_url: null # MCP server URL(s) tool_timeout: 60 # Tool execution timeout max_context_window: 32768 # Max context window in tokens (tool results auto-truncated) include_thinking: false # Include <think> in history # Training configuration trainer: num_epochs: 30 # Number of epochs learning_rate: 0.0001 # Learning rate save_step: 20 # Save checkpoint every N steps (0 = end of epoch) eval_step: 0 # Evaluate every N steps (0 = end of epoch)
Configuration Sections
data
| Field | Required | Default | Description |
|---|---|---|---|
train_file | Yes | train.jsonl | Path to training data |
batch_size | Yes | - | Prompts per batch (1-32) |
group_size | RL only | 4 | Rollouts per prompt (1-64) |
val_split | SFT only | 0.0 | Validation split ratio (0.0-1.0) |
Important: batch_size * group_size must be <= 2048 (concurrency limit).
model
| Field | Required | Default | Description |
|---|---|---|---|
path | Yes | - | Model name or checkpoint ID |
qlora_rank | No | 32 | LoRA rank for efficient finetuning |
qlora_alpha | No | rank * 2 | LoRA alpha scaling |
Supported Models
Qwen (Text)
- •
Qwen/Qwen3-8B(max rank: 128) - •
Qwen/Qwen3-4B-Instruct-2507(max rank: 128) - •
Qwen/Qwen3-30B-A3B(max rank: 64) - •
Qwen/Qwen3-30B-A3B-Instruct-2507(max rank: 64) - •
Qwen/Qwen3-32B(max rank: 128) - •
Qwen/Qwen3-235B-A22B-Instruct-2507(max rank: 64)
Qwen (Vision)
- •
Qwen/Qwen3-VL-30B-A3B-Instruct - •
Qwen/Qwen3-VL-235B-A22B-Instruct
Meta Llama (max rank: 128)
- •
meta-llama/Llama-3.3-70B-Instruct - •
meta-llama/Llama-3.1-70B - •
meta-llama/Llama-3.1-8B - •
meta-llama/Llama-3.1-8B-Instruct - •
meta-llama/Llama-3.2-3B - •
meta-llama/Llama-3.2-1B
DeepSeek (max rank: 64)
- •
deepseek-ai/DeepSeek-V3.1 - •
deepseek-ai/DeepSeek-V3.1-Base
OpenAI (max rank: 32)
- •
openai/gpt-oss-120b - •
openai/gpt-oss-20b
Moonshot
- •
moonshotai/Kimi-K2-Thinking
Multi-Model Training
Train multiple models with the same config:
model:
path:
- Qwen/Qwen3-4B-Instruct-2507
- Qwen/Qwen3-8B
- Qwen/Qwen3-30B-A3B
qlora_rank: 32
The CLI submits separate runs for each model.
teacher (Distillation only)
| Field | Required | Description |
|---|---|---|
path | Yes | Teacher model name (must be a supported model) |
The teacher provides supervision via reverse KL divergence. Use a larger/more capable model as teacher:
teacher: path: Qwen/Qwen3-32B # 32B teacher distilling to 8B student
Teacher selection tips:
- •Use a model from the same family (e.g., Qwen teacher for Qwen student)
- •Larger teachers generally produce better students
- •Teacher must be a supported model (see model list above)
algorithm (RL only)
| Field | Default | Options |
|---|---|---|
loss_fn | ppo | ppo, importance_sampling |
adv_estimator | grpo | grpo, gae, reinforce |
kl_penalty_coef | 0.01 | KL divergence penalty weight |
Recommendations:
- •Default
ppo+grpoworks well for most tasks - •Lower
kl_penalty_coef(0.001) for more exploration - •Higher
kl_penalty_coef(0.1) for stability
rollout (RL and Distillation)
| Field | Default | Description |
|---|---|---|
max_turns | 1 | Max conversation turns |
max_context_window | 2048 | Max tokens per generation |
termination_policy | last_tool | When to end episode |
thinking_mode | null | Chain-of-thought mode |
mcp_url | null | MCP server URL(s) |
tool_timeout | 60 | Tool execution timeout |
max_context_window | 32768 | Max context window in tokens |
include_thinking | false | Keep <think> in history |
Termination Policies
| Policy | Behavior |
|---|---|
last_tool | Episode ends when model responds without tool call |
max_turns | Episode always runs for exactly max_turns |
Thinking Mode (Reasoning Models)
For models that support chain-of-thought (<think> tags):
| Mode | Description |
|---|---|
null | Auto-enable for supported models |
disabled | Explicitly disable reasoning |
easy | Light reasoning |
medium | Moderate reasoning |
hard | Deep reasoning (more tokens) |
Important: Reasoning models need higher max_context_window (8192-16384).
MCP Configuration
Connect to external MCP servers for tools:
# Single server
rollout:
mcp_url: "https://mcp.tavily.com/mcp/?tavilyApiKey=YOUR_KEY"
# Multiple servers
rollout:
mcp_url:
- "https://mcp.tavily.com/mcp/?tavilyApiKey=..."
- "https://mcp.exa.ai/mcp/?apiKey=..."
# In-sandbox MCP (requires docker in train.jsonl)
rollout:
mcp_url: localhost:8931
trainer
| Field | Required | Default | Description |
|---|---|---|---|
num_epochs | Yes | - | Number of training epochs |
learning_rate | Yes | - | Learning rate |
save_step | No | 0 | Save every N steps (0 = end of epoch) |
eval_step | No | 0 | Evaluate every N steps (0 = end of epoch) |
train.jsonl Format
For full train.jsonl documentation including message format, sandbox/docker configuration, and examples, see the rnow-train-jsonl skill.
Converting HuggingFace Datasets
This section shows how to convert HuggingFace datasets to train.jsonl format.
SFT Conversion
For SFT, include both user and assistant messages:
from datasets import load_dataset
import json
dataset = load_dataset("your-dataset-name", split="train")
with open("train.jsonl", "w") as f:
for row in dataset:
entry = {
"messages": [
{"role": "user", "content": row["question"]},
{"role": "assistant", "content": row["answer"]}
]
}
f.write(json.dumps(entry) + "\n")
Alpaca-style Dataset
from datasets import load_dataset
import json
dataset = load_dataset("tatsu-lab/alpaca", split="train")
with open("train.jsonl", "w") as f:
for row in dataset:
if row.get("input"):
user_content = f"{row['instruction']}\n\nInput: {row['input']}"
else:
user_content = row["instruction"]
entry = {
"messages": [
{"role": "user", "content": user_content},
{"role": "assistant", "content": row["output"]}
]
}
f.write(json.dumps(entry) + "\n")
Multi-turn Conversations
# Input: {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
messages = []
for turn in row["conversations"]:
role = "user" if turn["from"] == "human" else "assistant"
messages.append({"role": role, "content": turn["value"]})
entry = {"messages": messages}
RL Conversion
For RL, include only the prompt (user message). The model generates responses during training.
from datasets import load_dataset
import json
dataset = load_dataset("your-math-dataset", split="train")
with open("train.jsonl", "w") as f:
for row in dataset:
entry = {
"messages": [
{"role": "user", "content": row["question"]}
],
"rewards": ["accuracy"],
"metadata": {
"expected_answer": row["answer"]
}
}
f.write(json.dumps(entry) + "\n")
GSM8K Math Dataset
from datasets import load_dataset
import json
import re
dataset = load_dataset("gsm8k", "main", split="train")
with open("train.jsonl", "w") as f:
for row in dataset:
# Extract final answer (#### followed by number)
answer_match = re.search(r"####\s*(.+)$", row["answer"])
final_answer = answer_match.group(1).strip() if answer_match else row["answer"]
entry = {
"messages": [{"role": "user", "content": row["question"]}],
"rewards": ["accuracy"],
"metadata": {"expected_answer": final_answer}
}
f.write(json.dumps(entry) + "\n")
MATH Dataset (Competition Math)
from datasets import load_dataset
import json
import re
dataset = load_dataset("hendrycks/competition_math", split="train")
def extract_boxed(text: str) -> str:
match = re.search(r"\\boxed\{([^}]+)\}", text)
return match.group(1) if match else text
with open("train.jsonl", "w") as f:
for row in dataset:
answer = extract_boxed(row["solution"])
# Wrap in $$ for math-verify (skip if already delimited or plain number)
if not answer.startswith(("$", "\\(")) and not answer.replace(".", "").replace("-", "").isdigit():
answer = f"$${answer}$$"
entry = {
"messages": [{"role": "user", "content": row["problem"]}],
"rewards": ["accuracy"],
"metadata": {"expected_answer": answer}
}
f.write(json.dumps(entry) + "\n")
Note: For math-verify, expected_answer MUST have math delimiters ($...$ or \(...\)). Raw LaTeX like \sqrt{2} won't parse - use $\sqrt{2}$. Plain numbers like 42 work as-is.
For reward function examples (math-verify, llm_judge), see the rnow-rewards skill.
Common Configurations
Math Reasoning
project_name: "Math Reasoning" dataset_type: rl data: train_file: train.jsonl batch_size: 8 group_size: 8 model: path: Qwen/Qwen3-8B qlora_rank: 64 algorithm: loss_fn: ppo adv_estimator: grpo kl_penalty_coef: 0.01 rollout: max_turns: 1 max_context_window: 8192 # High for reasoning thinking_mode: medium trainer: num_epochs: 20 learning_rate: 0.0001
Agent with Tools
project_name: "Search Agent" dataset_type: rl data: train_file: train.jsonl batch_size: 4 group_size: 4 model: path: Qwen/Qwen3-8B rollout: max_turns: 5 max_context_window: 2048 termination_policy: last_tool tool_timeout: 30 trainer: num_epochs: 15 learning_rate: 0.0001
Code Execution
project_name: "Code Agent" dataset_type: rl data: train_file: train.jsonl batch_size: 2 group_size: 4 model: path: Qwen/Qwen3-8B rollout: max_turns: 3 max_context_window: 4096 tool_timeout: 120 # Longer for code execution trainer: num_epochs: 10 learning_rate: 0.0001
SFT for Instruction Following
project_name: "Instruction Tuning" dataset_type: sft data: train_file: train.jsonl batch_size: 8 val_split: 0.1 model: path: Qwen/Qwen3-8B qlora_rank: 32 trainer: num_epochs: 3 learning_rate: 0.00005 eval_step: 100
Distillation for Reasoning
project_name: "Distilled Reasoning Model" dataset_type: distill data: train_file: train.jsonl batch_size: 8 group_size: 4 model: path: Qwen/Qwen3-8B # Student qlora_rank: 32 teacher: path: Qwen/Qwen3-32B # Teacher rollout: max_context_window: 8192 # Enough for reasoning trainer: num_epochs: 3 learning_rate: 0.0001 save_step: 20
When to use distillation:
- •Transfer reasoning capabilities from a large model to a smaller one
- •Create a cost-effective model that approximates a larger model's behavior
- •On-policy distillation avoids exposure bias (student learns from its own mistakes)
Validation Rules
- •batch_size * group_size <= 2048
- •qlora_rank <= model's max rank
- •Rewards in train.jsonl must exist in rewards.py (RL only)
- •Tools in train.jsonl must exist in tools.py (RL only)
- •sandbox=True requires docker field (RL only)
- •max_tokens must fit in context window
- •Distillation requires teacher section with valid model path
Testing Configuration
# Validate and test locally rnow test -n 3 --verbose # Test specific entries rnow test --entry 0,1,2 # Override model for testing rnow test --model gpt-5-nano