Docent Ingestion Skill

This skill provides a structured workflow for converting transcripts and evaluation logs into the correct format for ingestion to Docent, an agent analysis tool.

Overview of Docent

Docent is a trace analysis tool that helps researchers analyze and debug agents. Researchers upload a “collection” of traces (“agent runs”) into Docent, where the tool enables them to:

•Engage in structured data analysis such as grouping and joining to understand trends and create charts
•Quickly view traces of interest and capture human annotations of traces through labeling and comments
•Run a semantic search over transcripts by running a user-provided query of each transcript in their collection, and the cluster the results to understand high-level patterns
•Draft, refine, and iterate with the user on detailed rubrics to capture fuzzy behaviors like sycophancy, cheating, verbosity, etc.

Docent accelerates researchers by helping them form hypotheses and directing them to read the most relevant transcripts. Researchers use Docent to qualitatively explain and understand shifts in quantitative metrics. Common use cases for Docent include:

•Comparing between two checkpoints to understand a regression or to understand a quantitative tradeoff in their benchmark results
•Understanding an unexpected result. For instance, investigating why a checkpoint that receives high reward from a preference model (e.g. for code quality) appears to perform poorly with real users (e.g. PRs frequently rejected for low quality)
•Surfacing previously unknown failure modes. For instance, noticing that the timeout constraint is not explicit in an evaluation, causing thinking models to perform poorly compared to their non-thinking counterparts

When to Offer This Workflow

Trigger conditions: The user mentions phrases like "ingest transcripts into Docent", "upload to Docent", "import runs to Docent,” “move data into Docent,” “upload traces to Docent”

Initial offer:

Offer the user a structured workflow for ingesting their transcripts into Docent. Briefly explain the four stages:

•Context gathering: User provides relevant context on their data, including the path to the data, how it was produced, and what kinds of analysis they would like to do
•
Planning: Understand how the user’s data is organized, plan an ingestion strategy, and recommend a suggested organization in Docent to the user. Surface the plan for user approval.
1. •Examine the overall data hierarchy by mapping the directory and file structure to understand if there are recurring patterns.
2. •For individual transcripts, identify all unique formats and create a template for ingesting each one. Map out a schema of each unique transcript format and map each field to the most appropriate class in Docent.
3. •Propose an organization structure in Docent (broken down into collections, agent runs, transcript groups, and transcripts) that fits the user’s analysis needs.
•Ingestion: Given the suggested organization, plan how, write, and test a script that uploads all data from the user-provided directory to Docent.
•Testing: After uploading to a collection in Docent, use the Docent SDK to pull down the collection data and verify that the metadata, transcript formats, and overall organization match expectations.

Explain that you will ingest all the data provided in a directory of the user’s choosing. Explain the ask for context on the user’s analysis: while explaining is optional, it helps structure the data in Docent. Ask if they want to try this workflow or proceed freeform.

If the user declines, work freeform. If the user accepts, proceed to Stage 1.

Stage 1: Context Gathering

Environment Setup

Before running any Python commands, check for and activate a virtual environment:

shell

if [ -d "venv" ]; then
    source venv/bin/activate
elif [ -d ".venv" ]; then
    source .venv/bin/activate
fi

Gathering Information

Collect the information needed to plan and execute the ingestion. Let the user know they can answer in shorthand or dump context—whichever works best for them. Aim to collect answers to the following:

•API Key: What is your Docent API key? (You can find or create one at: https://docent.transluce.org/settings/api-keys)
•Data Path: What is the path to the files or directory you want to ingest?
•Collection Name: What would you like to name this collection in Docent?
•Data Context: What kind of data is this? (e.g., benchmark evaluation, agent task runs, multi-agent debate, red-teaming attempts) How was this data produced? What does each file represent? (e.g., one task per file, one episode per file, multiple attempts per task)
•Analysis Goals: What kinds of analysis do you want to do in Docent? (e.g., compare two model checkpoints, find failure modes, understand a metric regression)

Store these values for use in subsequent stages. Create ingestion-plan.md in the working directory to log all decisions and findings throughout the workflow. Here is an example structure:

code

# Docent Ingestion Plan

## Configuration
- Data path: [from user]
- Collection name: [from user]

## File Analysis
[to be filled in Stage 2a]

## Schema
[to be filled in Stage 2b]

## Data Structure Proposal
[to be filled in Stage 2c]

## Field Mapping
[to be filled in Stage 2c]

## Omitted Data
[MUST document any data not ingested and why]

## Execution Log
[to be filled in Stage 3]

## Verification
[to be filled in Stage 4 - compare expected vs actual counts]

Stage 2: Planning

Stage 2a: Understanding File Structure

Build understanding of the data organization to understand holistically how the user is storing their data and why they chose to organize it that way. Consider how this reflects on how they want their data stored in Docent. You can quickly get a sense of the data by using the appropriate strategies below.

Build Structural Tree

You can generate a folder-only tree with the following script, to see the overall directory structure. You may want to strategically list individual files in a few folders to understand them as well.

import os
from pathlib import Path
from collections import Counter

def build_folder_tree(path: str, max_depth: int = 5) -> dict:
    """Build a tree of folder structure, detecting patterns."""
    path = Path(path)

    def _recurse(p: Path, depth: int) -> dict:
        if depth > max_depth or not p.is_dir():
            return None

        children = {}
        file_extensions = Counter()

        for item in sorted(p.iterdir()):
            if item.is_dir():
                children[item.name] = _recurse(item, depth + 1)
            else:
                file_extensions[item.suffix.lower() or "no_ext"] += 1

        return {
            "children": children,
            "file_counts": dict(file_extensions),
            "total_files": sum(file_extensions.values()),
        }

    return _recurse(path, 0)

Understanding what individual files are in a few folders may also be useful. List files in key directories to understand the naming conventions and file types present.

Detect Naming Patterns

Examine folder and file names to understand the organizational logic. Sample a few names at different levels of the hierarchy and reason about what they might represent.

Common patterns to look for (as suggestions, not strict rules):

•Dates: ISO format (2024-01-15), compact (20240115), or human-readable (jan_15)
•Model identifiers: Model names, versions, or checkpoints
•Sequential numbering: run_001, sample_42, task_5, episode_100
•Experiment tags: baseline, ablation, v2, control, treatment
•Subdirectory conventions: trajs/, logs/, results/, metadata/, configs/

Rather than pattern-matching, describe what you observe and hypothesize about the user's organizational intent. For example:

•"Folders appear to be organized by date, then by model name"
•"Each subfolder contains a trajs/ directory with JSON files and a config.yaml"
•"File names include what looks like a task ID followed by an attempt number"

Ask the user to confirm your interpretation if uncertain.

Identify Repeatable Templates

Find the structural unit that repeats across the directory (e.g., each experiment folder has the same subdirectory structure):

def find_repeatable_template(tree: dict) -> dict:
    """Find the pattern that repeats across the directory structure."""

    def get_structure_signature(node: dict) -> tuple:
        if node is None:
            return ()
        children = node.get("children", {})
        child_names = tuple(sorted(children.keys()))
        file_exts = tuple(sorted(node.get("file_counts", {}).keys()))
        return (child_names, file_exts)

    signatures = {}
    def collect_signatures(node: dict, path: str = ""):
        if node is None:
            return
        sig = get_structure_signature(node)
        if sig not in signatures:
            signatures[sig] = []
        signatures[sig].append(path)
        for name, child in node.get("children", {}).items():
            collect_signatures(child, f"{path}/{name}")

    collect_signatures(tree)

    repeated = [(sig, paths) for sig, paths in signatures.items()
                if len(paths) > 1 and sig[0]]

    if repeated:
        repeated.sort(key=lambda x: len(x[1]), reverse=True)
        return {
            "template_structure": repeated[0][0],
            "instance_count": len(repeated[0][1]),
            "example_paths": repeated[0][1][:3],
        }
    return {"template_structure": None, "note": "No repeating pattern found"}

Detect Inspect AI Files

Check for Inspect AI .eval files, which have a dedicated loader:

def detect_inspect_files(path: Path) -> list[str]:
    """Detect Inspect .eval files that can use the built-in loader."""
    return [str(f) for f in path.rglob("*.eval")]

If Inspect .eval files are detected, use the built-in loader (see Stage 3).

Decision Point

Based on the structural analysis, determine next steps:

Structure Pattern	Action
Clear repeating template with trajs/logs subdirs	Proceed to schema inference on representative samples
Flat directory with consistent file types	Sample files directly for schema
Mixed/unclear structure	Ask user for clarification
Inspect .eval files present	Use built-in Inspect loader
No recognizable data files	Ask user to confirm path

Log the structural analysis to ingestion-plan.md.

Stage 2b: Schema Inference

Sample files strategically based on the template structure identified in Stage 2a.

Strategic Sampling

def sample_files_strategically(path: Path, template_info: dict) -> list[Path]:
    """Sample files from representative locations within the template structure."""
    samples = []

    if template_info.get("example_paths"):
        for instance_path in template_info["example_paths"][:2]:
            instance = path / instance_path.lstrip("/")
            for subdir in ["trajs", "trajectories", "logs", "results", ""]:
                candidate = instance / subdir if subdir else instance
                if candidate.exists():
                    json_files = list(candidate.glob("*.json"))[:1]
                    jsonl_files = list(candidate.glob("*.jsonl"))[:1]
                    samples.extend(json_files + jsonl_files)
                    if samples:
                        break

    if not samples:
        samples = list(path.rglob("*.json"))[:3] + list(path.rglob("*.jsonl"))[:2]

    return samples[:5]

Infer Schema

def infer_json_schema(data: dict | list, max_depth: int = 5) -> dict:
    """Recursively infer schema from JSON data."""
    if max_depth == 0:
        return {"type": "any", "note": "truncated"}

    if isinstance(data, dict):
        return {
            "type": "object",
            "fields": {
                k: infer_json_schema(v, max_depth - 1)
                for k, v in data.items()
            }
        }
    elif isinstance(data, list):
        if not data:
            return {"type": "array", "items": "unknown"}
        item_schemas = [infer_json_schema(item, max_depth - 1) for item in data[:3]]
        return {"type": "array", "items": item_schemas[0], "sample_count": len(data)}
    else:
        return {"type": type(data).__name__, "example": repr(data)[:100]}

Classify Fields

Identify fields that indicate transcript content, scores, and metadata:

TRANSCRIPT_INDICATORS = ["messages", "conversation", "transcript", "dialogue", "turns", "traj", "trajectory"]
SCORE_INDICATORS = ["score", "reward", "accuracy", "correct", "success", "metric", "result"]
ID_INDICATORS = ["id", "task_id", "sample_id", "episode", "run_id", "uuid"]

def classify_fields(schema: dict) -> dict:
    """Classify fields by their likely purpose."""
    classified = {"transcript": [], "scores": [], "identifiers": [], "metadata": []}

    def check_field(name: str, field_schema: dict, path: str = ""):
        full_path = f"{path}.{name}" if path else name
        name_lower = name.lower()

        if any(ind in name_lower for ind in TRANSCRIPT_INDICATORS):
            classified["transcript"].append(full_path)
        elif any(ind in name_lower for ind in SCORE_INDICATORS):
            classified["scores"].append(full_path)
        elif any(ind in name_lower for ind in ID_INDICATORS):
            classified["identifiers"].append(full_path)
        else:
            classified["metadata"].append(full_path)

        if field_schema.get("type") == "object":
            for sub_name, sub_schema in field_schema.get("fields", {}).items():
                check_field(sub_name, sub_schema, full_path)

    for name, field_schema in schema.get("fields", {}).items():
        check_field(name, field_schema)

    return classified

Log schema and field classification to ingestion-plan.md.

Stage 2c: Docent Organization Proposal

Propose how to organize the data in Docent based on the user's analysis goals and data structure.

Docent Hierarchy Best Practices

Critical: Most Docent analysis features (rubrics, search, clustering) operate at the AgentRun level. Structure data accordingly:

Level	Purpose	When to Use
Collection	One experiment, benchmark run, or dataset	Usually one per ingestion; multiple if fundamentally different experiments
AgentRun	Primary analysis unit	One per complete unit you want to analyze, compare, or score. Rubrics run here. Search returns these.
TranscriptGroup	Logical groupings within an AgentRun	Multiple attempts (pass@k), phases of a task
Transcript	One agent's conversation history	One per agent in multi-agent setups; otherwise usually one per AgentRun

Default: If unsure, make each independent task/episode/sample its own AgentRun with a single Transcript.

Tree/branching data: Ingest each branch as its own Transcript in its own AgentRun. Use metadata fields to identify how branches relate to each other (e.g., parent_branch_id, branch_depth, root_task_id).

Data Pattern to Docent Mapping

Data Pattern	Collection	AgentRun	TranscriptGroup	Transcript
Simple evals	experiment	sample_id, scores	—	messages
Pass@k	experiment	task_id, best_score	attempt_k	messages per attempt
Tree/branching	experiment	one per branch, with metadata linking branches	—	messages for that branch
Multi-agent	experiment	episode_id, joint_scores	—	one per agent

Field Mapping

Map each source field to a Docent location:

Source Field	Target Location	Target Field	Notes
messages	Transcript.messages	—	Convert via parse_chat_message
reward	AgentRun.metadata	scores.reward
task_id	AgentRun.metadata	task_id

Document Omitted Data

CRITICAL: If ANY data will not be ingested, document it clearly:

Field/File	Reason for Omission	Impact
`debug_logs/`	Contains only debug output, not agent transcripts	None
`raw_api_responses`	Redundant with parsed messages	Low

Never silently skip data.

Present Plan for Review

Present the ingestion plan to the user:

•Directory structure discovered
•Data type detected
•Proposed Docent hierarchy
•Key field mappings
•Omitted data (if any)

Wait for confirmation before proceeding to Stage 3.

Stage 3: Ingestion

Environment Setup

Activate virtual environment if present:

shell

if [ -d "venv" ]; then
    source venv/bin/activate
elif [ -d ".venv" ]; then
    source .venv/bin/activate
fi

Handle Inspect AI Files

If Inspect .eval files were detected, use the built-in loader:

from inspect_ai.log import read_eval_log
from docent.loaders.load_inspect import load_inspect_log

eval_log = read_eval_log("path/to/file.eval")
agent_runs = load_inspect_log(eval_log)
print(f"Loaded {len(agent_runs)} runs from Inspect log")

Skip to "Upload to Docent" below.

Custom Data Loading

For non-Inspect data, build the ingestion script incrementally.

Load Data

import os
import json
from pathlib import Path
from docent import Docent
from docent.data_models import AgentRun, Transcript, TranscriptGroup
from docent.data_models.chat import parse_chat_message, ToolCall

def load_data(path: str) -> list[dict]:
    """Load data based on structure identified in Stage 2a."""
    path = Path(path)
    records = []
    # Implementation based on detected template structure
    return records

raw_data = load_data(data_path)
print(f"Loaded {len(raw_data)} records")

Conversion Function

def convert_to_agent_run(record: dict) -> AgentRun:
    """Convert a single record to AgentRun."""
    raw_messages = record.get("messages") or record.get("traj") or []
    messages = [parse_chat_message(m) for m in raw_messages]

    # Handle tool calls if present
    for i, msg in enumerate(raw_messages):
        if msg.get("role") == "assistant" and msg.get("tool_calls"):
            messages[i].tool_calls = [
                ToolCall(
                    id=tc.get("id", f"call_{i}"),
                    function=tc.get("function", {}).get("name", tc.get("name", "")),
                    arguments=tc.get("function", {}).get("arguments", tc.get("arguments", {})),
                    type="function"
                )
                for tc in msg["tool_calls"]
            ]

    transcript = Transcript(
        messages=messages,
        metadata={...}  # transcript-level metadata from mapping
    )

    return AgentRun(
        transcripts=[transcript],
        metadata={
            "scores": {...},  # from mapping
            # other metadata from mapping
        }
    )

Validation Loop

Test conversion on a sample before full ingestion:

errors = []
for i, record in enumerate(raw_data[:10]):
    try:
        agent_run = convert_to_agent_run(record)
        _ = agent_run.text  # Validate by rendering
        print(f"✓ Record {i} converted successfully")
    except Exception as e:
        errors.append((i, str(e)))
        print(f"✗ Record {i} failed: {e}")

if errors:
    print(f"\n{len(errors)} validation errors in first 10 records")

Full Conversion

agent_runs = []
conversion_errors = []

for i, record in enumerate(raw_data):
    try:
        agent_runs.append(convert_to_agent_run(record))
    except Exception as e:
        conversion_errors.append({"index": i, "error": str(e)})

print(f"Converted {len(agent_runs)}/{len(raw_data)} records")
if conversion_errors:
    print(f"Errors ({len(conversion_errors)}): {conversion_errors[:5]}...")

Upload to Docent

client = Docent(api_key=DOCENT_API_KEY)

collection_id = client.create_collection(
    name=collection_name,
    description="",
)
print(f"Created collection: {collection_id}")

client.add_agent_runs(collection_id, agent_runs)
print(f"Uploaded {len(agent_runs)} runs")

print(f"View at: https://docent.transluce.org/collection/{collection_id}")

Stage 4: Testing & Verification

Verify that the upload succeeded and counts match expectations.

Count Verification

expected_runs = len(agent_runs)
failed_conversions = len(conversion_errors)
total_source_records = len(raw_data)

print(f"\n{'='*50}")
print("VERIFICATION REPORT")
print(f"{'='*50}")
print(f"Source records found:     {total_source_records}")
print(f"Successfully converted:   {expected_runs}")
print(f"Failed to convert:        {failed_conversions}")

# Verify upload via Docent SDK
try:
    collection_info = client.get_collection(collection_id)
    uploaded_count = collection_info.get("agent_run_count", "unknown")
    print(f"Uploaded to Docent:       {uploaded_count}")

    if uploaded_count != expected_runs:
        print(f"⚠️  WARNING: Count mismatch! Expected {expected_runs}, got {uploaded_count}")
    else:
        print(f"✓ Counts match!")
except Exception as e:
    print(f"Could not verify upload count via API: {e}")
    print(f"Please verify manually at: https://docent.transluce.org/collection/{collection_id}")

Log Verification Results

Update ingestion-plan.md:

code

## Verification

### Counts
- Source records: [total_source_records]
- Converted successfully: [expected_runs]
- Conversion failures: [failed_conversions]
- Uploaded to Docent: [uploaded_count]
- **Status:** [MATCH / MISMATCH]

### Errors (if any)
[List conversion errors with record index and error message]

### Collection URL
https://docent.transluce.org/collection/[collection_id]

Reference

See references/docent-data-models.md for complete Docent data model documentation.

For additional guidance on Docent data models and API usage, consult the official documentation: https://docs.transluce.org/llms.txt

Common Patterns

Inspect AI Logs

When .eval files detected, use the built-in loader:

from inspect_ai.log import read_eval_log
from docent.loaders.load_inspect import load_inspect_log

eval_log = read_eval_log("path/to/file.eval")
agent_runs = load_inspect_log(eval_log)

Parsing Chat Messages

Use parse_chat_message to convert dictionaries to proper message objects:

from docent.data_models.chat import parse_chat_message

# From dict - automatically determines message type from "role"
msg = parse_chat_message({
    "role": "user",
    "content": "What's 2+2?"
})

msg = parse_chat_message({
    "role": "assistant",
    "content": "The answer is 4."
})

msg = parse_chat_message({
    "role": "system",
    "content": "You are a helpful assistant."
})

# Direct construction is also available
from docent.data_models.chat import UserMessage, AssistantMessage, SystemMessage
msg = UserMessage(content="Hello")
msg = AssistantMessage(content="Hi!", model="gpt-4")

Simple Dict to AgentRun

A common pattern for converting flat records:

from docent.data_models import AgentRun, Transcript
from docent.data_models.chat import parse_chat_message

def convert_simple(record: dict) -> AgentRun:
    messages = [parse_chat_message(m) for m in record["messages"]]
    return AgentRun(
        transcripts=[Transcript(messages=messages)],
        metadata={
            "scores": {"reward": record.get("reward", 0)},
            **{k: v for k, v in record.items() if k != "messages"}
        }
    )

Tool Calls

Handle assistant messages with tool calls and their responses:

from docent.data_models.chat import AssistantMessage, ToolMessage, ToolCall

# Assistant making a tool call
assistant_msg = AssistantMessage(
    content="Let me search for that.",
    tool_calls=[
        ToolCall(
            id="call_123",
            function="web_search",
            arguments={"query": "weather today"},
            type="function"
        )
    ]
)

# Tool response
tool_msg = ToolMessage(
    content="Sunny, 72°F",
    tool_call_id="call_123",
    function="web_search"
)

# Helper to parse tool calls from raw data
def parse_tool_calls(raw_calls: list) -> list[ToolCall]:
    return [
        ToolCall(
            id=tc["id"],
            function=tc["function"]["name"],
            arguments=tc["function"].get("arguments", {}),
            type="function"
        )
        for tc in raw_calls
    ]

Pass@k Evaluation

Use TranscriptGroup for attempts:

from uuid import uuid4
from docent.data_models import AgentRun, Transcript, TranscriptGroup

def convert_pass_at_k(task_data: dict) -> AgentRun:
    agent_run_id = str(uuid4())
    groups = []
    transcripts = []

    for k, attempt in enumerate(task_data["attempts"]):
        group = TranscriptGroup(
            name=f"Attempt {k+1}",
            agent_run_id=agent_run_id,
            metadata={"k": k}
        )
        groups.append(group)

        transcript = Transcript(
            messages=[parse_chat_message(m) for m in attempt["messages"]],
            transcript_group_id=group.id,
            metadata={"attempt": k}
        )
        transcripts.append(transcript)

    return AgentRun(
        id=agent_run_id,
        transcripts=transcripts,
        transcript_groups=groups,
        metadata={"task_id": task_data["task_id"]}
    )

Tree/Branching

Ingest each branch as its own Transcript in its own AgentRun. Use metadata to link branches:

AgentRun(
    transcripts=[transcript],
    metadata={
        "root_task_id": "task_123",
        "branch_id": "branch_a_1",
        "parent_branch_id": "branch_a",
        "branch_depth": 2,
    }
)

Multi-Agent

One Transcript per agent in the same AgentRun:

AgentRun(
    transcripts=[
        Transcript(messages=agent_1_messages, metadata={"agent_id": "agent_1"}),
        Transcript(messages=agent_2_messages, metadata={"agent_id": "agent_2"}),
    ],
    metadata={
        "episode_id": "episode_42",
        "scores": {"joint_reward": 0.85}
    }
)

Validation

Always validate by rendering before upload:

try:
    _ = agent_run.text  # Triggers validation
    print("Valid")
except Exception as e:
    print(f"Invalid: {e}")