Investigate Dataset
This workflow helps you explore and understand datasets used in evaluations. It covers HuggingFace datasets, CSV files, and JSON/JSONL files.
Key Concepts
Dataset Types
- •
datasets.Dataset(HuggingFace): The raw dataset from thedatasetslibrary. Provides Arrow-backed storage with features like.features,.to_pandas(), column access, and streaming. This is what you get fromdatasets.load_dataset(). - •
inspect_ai.dataset.Dataset: An abstract base class (ABC) that extendsSequence[Sample]. The concrete implementation isMemoryDataset, which holds a list ofSampleobjects in memory. EachSamplehas fields:input,target,choices,id,metadata,sandbox,files,setup.
Relationship: The inspect_ai.dataset.hf_dataset() function:
- •Calls
datasets.load_dataset()to get a HuggingFacedatasets.Dataset - •Caches it to disk via
dataset.save_to_disk()in~/.cache/inspect_ai/hf_datasets/ - •Converts to a list of dicts via
dataset.to_list() - •Applies the
sample_fieldsfunction (eitherFieldSpecmapping or customRecordToSample) to each dict - •Returns a
MemoryDatasetcontaining the resultingSampleobjects
Key difference: datasets.Dataset is Arrow-backed and memory-efficient for large datasets. inspect_ai.dataset.Dataset holds all samples in memory as Python objects.
Common Patterns in Evals
Evals typically define:
- •
DATASET_PATH: HuggingFace repo path (e.g.,"qiaojin/PubMedQA") - •
DATASET_REVISION: Optional git revision/tag for reproducibility - •
record_to_sample(): Function converting raw records toSampleobjects
Prerequisites
- •Access to the evaluation code to find dataset configuration
- •Python environment with
datasets,pandas, andinspect_aiinstalled
Steps
1. Identify the Dataset Source
Look for these patterns in the evaluation code:
# HuggingFace dataset
DATASET_PATH = "org/dataset-name"
DATASET_REVISION = "v1.0" # optional
hf_dataset(path=DATASET_PATH, name="subset", split="train", ...)
# CSV dataset
csv_dataset("path/to/file.csv", ...)
load_csv_dataset("https://example.com/file.csv", eval_name="myeval", ...)
# JSON/JSONL dataset
json_dataset("path/to/file.json", ...)
load_json_dataset("https://example.com/file.jsonl", eval_name="myeval", ...)
2. Load the Raw Dataset
For investigation, load the raw data directly (not through Inspect's sample_fields transformation).
HuggingFace Datasets
from datasets import load_dataset
# Basic loading
ds = load_dataset("org/dataset-name", split="train")
# With subset/config name
ds = load_dataset("org/dataset-name", "subset_name", split="train")
# With specific revision
ds = load_dataset("org/dataset-name", revision="v1.0", split="train")
# For gated datasets, ensure HF_TOKEN is set or use:
# huggingface-cli login
CSV Files
import pandas as pd
# Direct pandas loading (recommended for investigation)
df = pd.read_csv("path/to/file.csv")
# With encoding/delimiter options
df = pd.read_csv("path/to/file.csv", encoding="utf-8", delimiter=",")
# For remote URLs
df = pd.read_csv("https://example.com/file.csv")
JSON/JSONL Files
import pandas as pd
# JSON file (array of objects)
df = pd.read_json("path/to/file.json")
# JSONL file (one JSON object per line)
df = pd.read_json("path/to/file.jsonl", lines=True)
# For remote URLs
df = pd.read_json("https://example.com/file.jsonl", lines=True)
Note: For investigation, using pandas directly on raw files is simpler than going through Inspect's loaders, since you want to see the raw data structure before any sample_fields transformation.
3. Explore Dataset Structure
For HuggingFace Datasets
# View features (schema)
print(ds.features)
# Output: {'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos']), ...}
# Dataset info
print(f"Number of samples: {len(ds)}")
print(f"Column names: {ds.column_names}")
print(f"Dataset info: {ds.info}")
# View first few samples
for i, sample in enumerate(ds):
if i >= 3:
break
print(sample)
# Or use slicing
print(ds[:3])
For CSV/JSON DataFrames
# Basic info
print(df.info())
print(f"Number of rows: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
# View first few rows
print(df.head(10))
# Data types
print(df.dtypes)
4. Convert to Pandas for Analysis
HuggingFace Dataset to DataFrame
import pandas as pd
# Full conversion (for smaller datasets)
df = ds.to_pandas()
# For large datasets, sample first
df = ds.select(range(min(1000, len(ds)))).to_pandas()
# Basic analysis
print(df.info())
print(df.describe())
print(df.head(10))
# Check for missing values
print(df.isnull().sum())
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Value counts for categorical columns
for col in df.select_dtypes(include=['object']).columns:
print(f"\n{col} value counts:")
print(df[col].value_counts().head(10))
Inspect Dataset to DataFrame
Unlike HuggingFace's datasets.Dataset, inspect_ai.dataset.Dataset doesn't have a .to_pandas() method. Convert manually:
import pandas as pd
from inspect_ai.dataset import csv_dataset, json_dataset
# Load through Inspect
dataset = csv_dataset("path/to/file.csv") # or json_dataset()
# Convert Sample objects to DataFrame
df = pd.DataFrame([
{
"id": s.id,
"input": s.input,
"target": s.target,
"choices": s.choices,
**(s.metadata or {})
}
for s in dataset
])
5. Investigate Specific Fields
# For nested fields (common in HF datasets)
# Example: {'answers': {'text': [...], 'answer_start': [...]}}
if 'answers' in ds.features:
print(ds.features['answers'])
# Access nested data
print(ds[0]['answers'])
# For ClassLabel fields
if hasattr(ds.features.get('label', None), 'names'):
print(f"Label names: {ds.features['label'].names}")
print(f"Label distribution: {pd.Series(ds['label']).value_counts()}")
# Check field lengths/sizes
if 'text' in ds.column_names:
lengths = [len(t) for t in ds['text']]
print(f"Text length - min: {min(lengths)}, max: {max(lengths)}, avg: {sum(lengths)/len(lengths):.0f}")
6. Check Data Quality
# Missing/empty values
for col in ds.column_names:
data = ds[col]
empty_count = sum(1 for x in data if x is None or x == "" or x == [])
if empty_count > 0:
print(f"{col}: {empty_count} empty values ({100*empty_count/len(ds):.1f}%)")
# Duplicate IDs (if ID field exists)
if 'id' in ds.column_names:
ids = ds['id']
unique_ids = set(ids)
if len(ids) != len(unique_ids):
print(f"Duplicate IDs found: {len(ids) - len(unique_ids)}")
# Check for expected fields
expected_fields = ['input', 'target', 'id'] # adjust as needed
missing = [f for f in expected_fields if f not in ds.column_names]
if missing:
print(f"Missing expected fields: {missing}")
7. Understand the Sample Conversion
Look at the record_to_sample function to understand how raw data maps to Inspect samples:
# Example pattern
def record_to_sample(record: dict) -> Sample:
return Sample(
input=record["question"], # What the model sees
target=record["answer"], # Expected answer
id=record.get("id"), # Unique identifier
choices=record.get("options", []), # For multiple choice
metadata={"source": record["source"]} # Extra info
)
Key questions:
- •Which fields become
input? Are they combined/formatted? - •What is the
targetformat? (letter, text, JSON, etc.) - •Are there
choicesfor multiple choice? - •What goes into
metadata? - •Are any records filtered out?
8. Test the Inspect Dataset Loading
# Load through Inspect to verify conversion works
from inspect_evals.utils.huggingface import hf_dataset
# or: from inspect_ai.dataset import hf_dataset
dataset = hf_dataset(
path="org/dataset-name",
revision="v1.0", # optional
name="subset",
split="train",
sample_fields=record_to_sample, # from the eval
)
# Check samples
print(f"Loaded {len(dataset)} samples")
for sample in dataset[:3]:
print(f"ID: {sample.id}")
print(f"Input: {sample.input[:200]}...")
print(f"Target: {sample.target}")
print(f"Choices: {sample.choices}")
print(f"Metadata: {sample.metadata}")
print("---")
Caching
HuggingFace Native Cache
- •Default location:
~/.cache/huggingface/datasets/ - •Control with
HF_DATASETS_CACHEenvironment variable - •Datasets are cached by fingerprint (hash of transforms applied)
Inspect AI HuggingFace Cache
- •Location:
~/.cache/inspect_ai/hf_datasets/ - •Used by
inspect_ai.dataset.hf_dataset() - •Caches the processed
datasets.Datasetviasave_to_disk()/load_from_disk() - •Cache key: hash of
path + name + data_dir + split + kwargs - •Pass
cached=Falseto force re-download, or userevisionparameter (which always revalidates)
Inspect Evals Cache
- •Location:
platformdirs.user_cache_dir("inspect_evals")(typically~/.cache/inspect_evals/on Linux,~/Library/Caches/inspect_evalson macOS) - •Used by
load_csv_datasetandload_json_datasetfor remote URLs - •Organized by eval name with URL hash filenames
To force re-download:
# HuggingFace native
ds = load_dataset("org/name", download_mode="force_redownload")
# Inspect AI hf_dataset
from inspect_ai.dataset import hf_dataset
ds = hf_dataset("org/name", split="train", cached=False)
# Inspect evals cached loaders
load_json_dataset(url, eval_name="myeval", refresh=True)
Quick Reference Commands
# View HF dataset info without downloading
uv run python -c "from datasets import load_dataset_builder; b = load_dataset_builder('org/name'); print(b.info)"
# List available configs/subsets
uv run python -c "from datasets import get_dataset_config_names; print(get_dataset_config_names('org/name'))"
# List available splits
uv run python -c "from datasets import load_dataset; print(load_dataset('org/name', split=None).keys())"
Using Test Utilities
The tests/utils/huggingface.py module provides utilities for querying HuggingFace dataset metadata via the Dataset Viewer API (without downloading the full dataset):
from tests.utils.huggingface import get_dataset_infos_dict, DatasetInfo
# Get dataset info for all configs/subsets
infos = get_dataset_infos_dict("qiaojin/PubMedQA")
# Returns a dict mapping config names to DatasetInfo objects
for config_name, info in infos.items():
print(f"Config: {config_name}")
print(f" Splits: {list(info.splits.keys())}")
print(f" Features: {list(info.features.keys())}")
print(f" Num rows: {info.splits['train'].num_examples if 'train' in info.splits else 'N/A'}")
The DatasetInfo object contains:
- •
features: Dict of column names to feature types (schema) - •
splits: Dict of split names toSplitInfo(withnum_examples,num_bytes) - •
dataset_name,config_name: Identifiers - •
description,citation,license: Metadata
Troubleshooting
- •Gated dataset: Run
huggingface-cli loginor setHF_TOKEN - •Rate limited: The
hf_datasetwrapper ininspect_evals.utils.huggingfacehas built-in retry with backoff - •Large dataset: Use
streaming=Trueorsplit="train[:1000]"for sampling - •Missing revision: Check the dataset's "Files and versions" tab on HuggingFace