AgentSkillsCN

investigate-dataset

从 HuggingFace、CSV 或 JSON 文件中深入探究数据集,全面了解其结构、字段以及数据质量。

SKILL.md
--- frontmatter
name: investigate-dataset
description: Investigate datasets from HuggingFace, CSV, or JSON files to understand their structure, fields, and data quality

Investigate Dataset

This workflow helps you explore and understand datasets used in evaluations. It covers HuggingFace datasets, CSV files, and JSON/JSONL files.

Key Concepts

Dataset Types

  1. datasets.Dataset (HuggingFace): The raw dataset from the datasets library. Provides Arrow-backed storage with features like .features, .to_pandas(), column access, and streaming. This is what you get from datasets.load_dataset().

  2. inspect_ai.dataset.Dataset: An abstract base class (ABC) that extends Sequence[Sample]. The concrete implementation is MemoryDataset, which holds a list of Sample objects in memory. Each Sample has fields: input, target, choices, id, metadata, sandbox, files, setup.

Relationship: The inspect_ai.dataset.hf_dataset() function:

  1. Calls datasets.load_dataset() to get a HuggingFace datasets.Dataset
  2. Caches it to disk via dataset.save_to_disk() in ~/.cache/inspect_ai/hf_datasets/
  3. Converts to a list of dicts via dataset.to_list()
  4. Applies the sample_fields function (either FieldSpec mapping or custom RecordToSample) to each dict
  5. Returns a MemoryDataset containing the resulting Sample objects

Key difference: datasets.Dataset is Arrow-backed and memory-efficient for large datasets. inspect_ai.dataset.Dataset holds all samples in memory as Python objects.

Common Patterns in Evals

Evals typically define:

  • DATASET_PATH: HuggingFace repo path (e.g., "qiaojin/PubMedQA")
  • DATASET_REVISION: Optional git revision/tag for reproducibility
  • record_to_sample(): Function converting raw records to Sample objects

Prerequisites

  • Access to the evaluation code to find dataset configuration
  • Python environment with datasets, pandas, and inspect_ai installed

Steps

1. Identify the Dataset Source

Look for these patterns in the evaluation code:

python
# HuggingFace dataset
DATASET_PATH = "org/dataset-name"
DATASET_REVISION = "v1.0"  # optional
hf_dataset(path=DATASET_PATH, name="subset", split="train", ...)

# CSV dataset
csv_dataset("path/to/file.csv", ...)
load_csv_dataset("https://example.com/file.csv", eval_name="myeval", ...)

# JSON/JSONL dataset
json_dataset("path/to/file.json", ...)
load_json_dataset("https://example.com/file.jsonl", eval_name="myeval", ...)

2. Load the Raw Dataset

For investigation, load the raw data directly (not through Inspect's sample_fields transformation).

HuggingFace Datasets

python
from datasets import load_dataset

# Basic loading
ds = load_dataset("org/dataset-name", split="train")

# With subset/config name
ds = load_dataset("org/dataset-name", "subset_name", split="train")

# With specific revision
ds = load_dataset("org/dataset-name", revision="v1.0", split="train")

# For gated datasets, ensure HF_TOKEN is set or use:
# huggingface-cli login

CSV Files

python
import pandas as pd

# Direct pandas loading (recommended for investigation)
df = pd.read_csv("path/to/file.csv")

# With encoding/delimiter options
df = pd.read_csv("path/to/file.csv", encoding="utf-8", delimiter=",")

# For remote URLs
df = pd.read_csv("https://example.com/file.csv")

JSON/JSONL Files

python
import pandas as pd

# JSON file (array of objects)
df = pd.read_json("path/to/file.json")

# JSONL file (one JSON object per line)
df = pd.read_json("path/to/file.jsonl", lines=True)

# For remote URLs
df = pd.read_json("https://example.com/file.jsonl", lines=True)

Note: For investigation, using pandas directly on raw files is simpler than going through Inspect's loaders, since you want to see the raw data structure before any sample_fields transformation.

3. Explore Dataset Structure

For HuggingFace Datasets

python
# View features (schema)
print(ds.features)
# Output: {'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos']), ...}

# Dataset info
print(f"Number of samples: {len(ds)}")
print(f"Column names: {ds.column_names}")
print(f"Dataset info: {ds.info}")

# View first few samples
for i, sample in enumerate(ds):
    if i >= 3:
        break
    print(sample)

# Or use slicing
print(ds[:3])

For CSV/JSON DataFrames

python
# Basic info
print(df.info())
print(f"Number of rows: {len(df)}")
print(f"Columns: {df.columns.tolist()}")

# View first few rows
print(df.head(10))

# Data types
print(df.dtypes)

4. Convert to Pandas for Analysis

HuggingFace Dataset to DataFrame

python
import pandas as pd

# Full conversion (for smaller datasets)
df = ds.to_pandas()

# For large datasets, sample first
df = ds.select(range(min(1000, len(ds)))).to_pandas()

# Basic analysis
print(df.info())
print(df.describe())
print(df.head(10))

# Check for missing values
print(df.isnull().sum())

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Value counts for categorical columns
for col in df.select_dtypes(include=['object']).columns:
    print(f"\n{col} value counts:")
    print(df[col].value_counts().head(10))

Inspect Dataset to DataFrame

Unlike HuggingFace's datasets.Dataset, inspect_ai.dataset.Dataset doesn't have a .to_pandas() method. Convert manually:

python
import pandas as pd
from inspect_ai.dataset import csv_dataset, json_dataset

# Load through Inspect
dataset = csv_dataset("path/to/file.csv")  # or json_dataset()

# Convert Sample objects to DataFrame
df = pd.DataFrame([
    {
        "id": s.id,
        "input": s.input,
        "target": s.target,
        "choices": s.choices,
        **(s.metadata or {})
    }
    for s in dataset
])

5. Investigate Specific Fields

python
# For nested fields (common in HF datasets)
# Example: {'answers': {'text': [...], 'answer_start': [...]}}
if 'answers' in ds.features:
    print(ds.features['answers'])
    # Access nested data
    print(ds[0]['answers'])

# For ClassLabel fields
if hasattr(ds.features.get('label', None), 'names'):
    print(f"Label names: {ds.features['label'].names}")
    print(f"Label distribution: {pd.Series(ds['label']).value_counts()}")

# Check field lengths/sizes
if 'text' in ds.column_names:
    lengths = [len(t) for t in ds['text']]
    print(f"Text length - min: {min(lengths)}, max: {max(lengths)}, avg: {sum(lengths)/len(lengths):.0f}")

6. Check Data Quality

python
# Missing/empty values
for col in ds.column_names:
    data = ds[col]
    empty_count = sum(1 for x in data if x is None or x == "" or x == [])
    if empty_count > 0:
        print(f"{col}: {empty_count} empty values ({100*empty_count/len(ds):.1f}%)")

# Duplicate IDs (if ID field exists)
if 'id' in ds.column_names:
    ids = ds['id']
    unique_ids = set(ids)
    if len(ids) != len(unique_ids):
        print(f"Duplicate IDs found: {len(ids) - len(unique_ids)}")

# Check for expected fields
expected_fields = ['input', 'target', 'id']  # adjust as needed
missing = [f for f in expected_fields if f not in ds.column_names]
if missing:
    print(f"Missing expected fields: {missing}")

7. Understand the Sample Conversion

Look at the record_to_sample function to understand how raw data maps to Inspect samples:

python
# Example pattern
def record_to_sample(record: dict) -> Sample:
    return Sample(
        input=record["question"],           # What the model sees
        target=record["answer"],            # Expected answer
        id=record.get("id"),                # Unique identifier
        choices=record.get("options", []),  # For multiple choice
        metadata={"source": record["source"]}  # Extra info
    )

Key questions:

  • Which fields become input? Are they combined/formatted?
  • What is the target format? (letter, text, JSON, etc.)
  • Are there choices for multiple choice?
  • What goes into metadata?
  • Are any records filtered out?

8. Test the Inspect Dataset Loading

python
# Load through Inspect to verify conversion works
from inspect_evals.utils.huggingface import hf_dataset
# or: from inspect_ai.dataset import hf_dataset

dataset = hf_dataset(
    path="org/dataset-name",
    revision="v1.0",  # optional
    name="subset",
    split="train",
    sample_fields=record_to_sample,  # from the eval
)

# Check samples
print(f"Loaded {len(dataset)} samples")
for sample in dataset[:3]:
    print(f"ID: {sample.id}")
    print(f"Input: {sample.input[:200]}...")
    print(f"Target: {sample.target}")
    print(f"Choices: {sample.choices}")
    print(f"Metadata: {sample.metadata}")
    print("---")

Caching

HuggingFace Native Cache

  • Default location: ~/.cache/huggingface/datasets/
  • Control with HF_DATASETS_CACHE environment variable
  • Datasets are cached by fingerprint (hash of transforms applied)

Inspect AI HuggingFace Cache

  • Location: ~/.cache/inspect_ai/hf_datasets/
  • Used by inspect_ai.dataset.hf_dataset()
  • Caches the processed datasets.Dataset via save_to_disk() / load_from_disk()
  • Cache key: hash of path + name + data_dir + split + kwargs
  • Pass cached=False to force re-download, or use revision parameter (which always revalidates)

Inspect Evals Cache

  • Location: platformdirs.user_cache_dir("inspect_evals") (typically ~/.cache/inspect_evals/ on Linux, ~/Library/Caches/inspect_evals on macOS)
  • Used by load_csv_dataset and load_json_dataset for remote URLs
  • Organized by eval name with URL hash filenames

To force re-download:

python
# HuggingFace native
ds = load_dataset("org/name", download_mode="force_redownload")

# Inspect AI hf_dataset
from inspect_ai.dataset import hf_dataset
ds = hf_dataset("org/name", split="train", cached=False)

# Inspect evals cached loaders
load_json_dataset(url, eval_name="myeval", refresh=True)

Quick Reference Commands

bash
# View HF dataset info without downloading
uv run python -c "from datasets import load_dataset_builder; b = load_dataset_builder('org/name'); print(b.info)"

# List available configs/subsets
uv run python -c "from datasets import get_dataset_config_names; print(get_dataset_config_names('org/name'))"

# List available splits
uv run python -c "from datasets import load_dataset; print(load_dataset('org/name', split=None).keys())"

Using Test Utilities

The tests/utils/huggingface.py module provides utilities for querying HuggingFace dataset metadata via the Dataset Viewer API (without downloading the full dataset):

python
from tests.utils.huggingface import get_dataset_infos_dict, DatasetInfo

# Get dataset info for all configs/subsets
infos = get_dataset_infos_dict("qiaojin/PubMedQA")

# Returns a dict mapping config names to DatasetInfo objects
for config_name, info in infos.items():
    print(f"Config: {config_name}")
    print(f"  Splits: {list(info.splits.keys())}")
    print(f"  Features: {list(info.features.keys())}")
    print(f"  Num rows: {info.splits['train'].num_examples if 'train' in info.splits else 'N/A'}")

The DatasetInfo object contains:

  • features: Dict of column names to feature types (schema)
  • splits: Dict of split names to SplitInfo (with num_examples, num_bytes)
  • dataset_name, config_name: Identifiers
  • description, citation, license: Metadata

Troubleshooting

  • Gated dataset: Run huggingface-cli login or set HF_TOKEN
  • Rate limited: The hf_dataset wrapper in inspect_evals.utils.huggingface has built-in retry with backoff
  • Large dataset: Use streaming=True or split="train[:1000]" for sampling
  • Missing revision: Check the dataset's "Files and versions" tab on HuggingFace