AgentSkillsCN

Add Tests

添加测试

SKILL.md

/add-tests

Reads an arXiv paper and adds comprehensive pytest tests to its implementation, using mocks for LLM calls and external data.

Usage

code
/add-tests <arxiv_id> [options]

Arguments

  • <arxiv_id>: arXiv paper ID (e.g., 2301.00001) or implementation directory name (e.g., 2301_00001)

Options

  • --coverage: Target test coverage percentage (default: 80)
  • --mock-all: Mock all external dependencies including file I/O
  • --integration: Also generate integration tests

Workflow

When this skill is invoked, follow these phases in order:

Phase 1: Locate Implementation and Paper

  1. Find the implementation directory

    bash
    cd /Users/shibuiyusuke/tmp/paper2code
    # List available implementations
    ls paper_impl/
    
  2. Verify the implementation exists

    bash
    # Check structure of target implementation
    ls -la paper_impl/{arxiv_id}/
    ls -la paper_impl/{arxiv_id}/src/
    
  3. Read the paper PDF using the Read tool:

    • Read the PDF at paper_impl/{arxiv_id}/ to understand:
      • The algorithm/model being implemented
      • Expected inputs/outputs
      • Edge cases mentioned in the paper
      • Numerical examples that can be used as test cases

Phase 2: Analyze Existing Code

  1. Read all source files in paper_impl/{arxiv_id}/src/:

    • Identify all classes, functions, and methods
    • Note which functions call LLMs or external APIs
    • Identify data loading and processing functions
    • Document the public interface that needs testing
  2. Identify mock requirements:

    • LLM calls: Functions that call OpenAI, Anthropic, or other LLM APIs
    • External APIs: HTTP requests, database queries
    • File I/O: Reading/writing files, especially PDFs or large datasets
    • Network requests: Any requests or httpx calls
  3. Check for existing tests:

    bash
    ls paper_impl/{arxiv_id}/tests/
    

Phase 3: Create Test Infrastructure

  1. Create test directory structure (if not exists):

    code
    paper_impl/{arxiv_id}/tests/
    ├── __init__.py
    ├── conftest.py          # Shared fixtures and mocks
    ├── test_model.py        # Core model tests
    ├── test_layers.py       # Layer/module tests
    ├── test_utils.py        # Utility function tests
    └── test_integration.py  # Integration tests (if --integration)
    
  2. Create conftest.py with common fixtures:

    python
    """Shared fixtures and mocks for testing."""
    
    import pytest
    from unittest.mock import Mock, MagicMock, patch
    import numpy as np
    import torch  # or appropriate framework
    
    # === LLM Mock Fixtures ===
    
    @pytest.fixture
    def mock_llm_response():
        """Mock LLM response for testing."""
        return {
            "choices": [{
                "message": {
                    "content": "Mocked LLM response for testing purposes.",
                    "role": "assistant"
                }
            }],
            "usage": {"prompt_tokens": 10, "completion_tokens": 20}
        }
    
    @pytest.fixture
    def mock_openai_client(mock_llm_response):
        """Mock OpenAI client."""
        mock_client = MagicMock()
        mock_client.chat.completions.create.return_value = MagicMock(
            choices=[MagicMock(message=MagicMock(content="Mocked response"))]
        )
        return mock_client
    
    @pytest.fixture
    def mock_anthropic_client():
        """Mock Anthropic client."""
        mock_client = MagicMock()
        mock_client.messages.create.return_value = MagicMock(
            content=[MagicMock(text="Mocked response")]
        )
        return mock_client
    
    # === Data Mock Fixtures ===
    
    @pytest.fixture
    def sample_embedding():
        """Sample embedding vector for testing."""
        return np.random.randn(768).astype(np.float32)
    
    @pytest.fixture
    def sample_batch_embeddings():
        """Batch of sample embeddings."""
        return np.random.randn(8, 768).astype(np.float32)
    
    @pytest.fixture
    def sample_text_data():
        """Sample text data for testing."""
        return [
            "This is a test sentence for embedding.",
            "Another example text for testing purposes.",
            "Machine learning models need training data.",
        ]
    
    @pytest.fixture
    def mock_pdf_content():
        """Mock PDF content."""
        return """
        Abstract: This paper presents a novel approach...
        1. Introduction
        The problem of X is challenging because...
        2. Method
        We propose Algorithm 1 which...
        """
    
    # === Network/API Mock Fixtures ===
    
    @pytest.fixture
    def mock_http_response():
        """Mock HTTP response."""
        mock_response = MagicMock()
        mock_response.status_code = 200
        mock_response.json.return_value = {"data": "mocked"}
        mock_response.text = "Mocked response text"
        return mock_response
    
    # === Tensor Fixtures (for PyTorch-based implementations) ===
    
    @pytest.fixture
    def sample_tensor():
        """Sample tensor for testing."""
        return torch.randn(2, 10, 64)  # (batch, seq_len, dim)
    
    @pytest.fixture
    def device():
        """Get appropriate device."""
        return torch.device("cuda" if torch.cuda.is_available() else "cpu")
    

Phase 4: Write Unit Tests

Write tests following these patterns:

4.1 Testing Functions with LLM Calls

python
"""Tests for modules that call LLMs."""

import pytest
from unittest.mock import patch, MagicMock

class TestLLMModule:
    """Tests for LLM-dependent functionality."""

    def test_llm_call_with_mock(self, mock_openai_client):
        """Test function that calls LLM with mocked response."""
        with patch("src.module.openai_client", mock_openai_client):
            from src.module import process_with_llm

            result = process_with_llm("test input")

            # Verify LLM was called correctly
            mock_openai_client.chat.completions.create.assert_called_once()
            assert result is not None

    def test_llm_error_handling(self):
        """Test graceful handling of LLM errors."""
        with patch("src.module.openai_client") as mock_client:
            mock_client.chat.completions.create.side_effect = Exception("API Error")

            from src.module import process_with_llm

            with pytest.raises(Exception):
                process_with_llm("test input")

    @pytest.fixture
    def mock_structured_response(self):
        """Mock for structured LLM output."""
        return {
            "analysis": "mocked analysis",
            "confidence": 0.95,
            "recommendations": ["rec1", "rec2"]
        }

    def test_structured_llm_output(self, mock_structured_response):
        """Test parsing of structured LLM output."""
        with patch("src.module.call_llm") as mock_call:
            import json
            mock_call.return_value = json.dumps(mock_structured_response)

            from src.module import get_structured_analysis

            result = get_structured_analysis("input")
            assert "analysis" in result
            assert result["confidence"] == 0.95

4.2 Testing Data Processing with Mocked Data

python
"""Tests for data processing functions."""

import pytest
from unittest.mock import Mock, patch, mock_open
import numpy as np

class TestDataProcessing:
    """Tests for data loading and processing."""

    def test_load_embeddings(self, sample_batch_embeddings):
        """Test embedding loading with mocked data."""
        with patch("numpy.load") as mock_load:
            mock_load.return_value = sample_batch_embeddings

            from src.data_loader import load_embeddings

            result = load_embeddings("fake_path.npy")
            assert result.shape == (8, 768)

    def test_process_pdf(self, mock_pdf_content):
        """Test PDF processing with mocked content."""
        with patch("src.pdf_parser.extract_text") as mock_extract:
            mock_extract.return_value = mock_pdf_content

            from src.pdf_parser import process_paper

            result = process_paper("fake_paper.pdf")
            assert "Abstract" in result

    def test_file_not_found(self):
        """Test handling of missing files."""
        with patch("builtins.open", side_effect=FileNotFoundError):
            from src.data_loader import load_config

            with pytest.raises(FileNotFoundError):
                load_config("nonexistent.yaml")

4.3 Testing Model Components

python
"""Tests for model/algorithm components."""

import pytest
import torch
import numpy as np

class TestModel:
    """Tests for core model functionality."""

    @pytest.fixture
    def model_config(self):
        """Create test configuration."""
        from src.config import Config
        return Config(
            hidden_dim=64,
            num_layers=2,
            dropout=0.0,  # Disable dropout for deterministic tests
        )

    @pytest.fixture
    def model(self, model_config):
        """Create model instance."""
        from src.model import Model
        return Model(model_config)

    def test_forward_pass_shape(self, model, sample_tensor):
        """Test output shape matches expected."""
        output = model(sample_tensor)
        assert output.shape == sample_tensor.shape

    def test_gradient_flow(self, model, sample_tensor):
        """Test gradients flow through model."""
        sample_tensor.requires_grad_(True)
        output = model(sample_tensor)
        loss = output.sum()
        loss.backward()

        assert sample_tensor.grad is not None
        assert not torch.isnan(sample_tensor.grad).any()

    def test_deterministic_output(self, model, sample_tensor):
        """Test model produces consistent output."""
        torch.manual_seed(42)
        output1 = model(sample_tensor)

        torch.manual_seed(42)
        output2 = model(sample_tensor)

        assert torch.allclose(output1, output2)

    def test_batch_independence(self, model):
        """Test batch elements are processed independently."""
        x = torch.randn(4, 10, 64)

        # Process full batch
        full_output = model(x)

        # Process individual samples
        individual_outputs = torch.stack([model(x[i:i+1]) for i in range(4)])

        assert torch.allclose(full_output, individual_outputs.squeeze(1), atol=1e-5)

    def test_numerical_stability(self, model):
        """Test model handles edge cases."""
        # Very small values
        small_input = torch.randn(2, 10, 64) * 1e-6
        output_small = model(small_input)
        assert not torch.isnan(output_small).any()
        assert not torch.isinf(output_small).any()

        # Very large values
        large_input = torch.randn(2, 10, 64) * 1e3
        output_large = model(large_input)
        assert not torch.isnan(output_large).any()

4.4 Testing with Paper-Specific Examples

python
"""Tests using examples from the paper."""

import pytest
import numpy as np

class TestPaperExamples:
    """Tests derived from examples in the paper."""

    def test_equation_3_implementation(self):
        """
        Test implementation of Eq. (3) from Section 3.2.

        According to the paper:
        y = softmax(QK^T / sqrt(d_k)) @ V

        With Q = K = V = I (identity), output should equal softmax(I/sqrt(d)) @ I
        """
        from src.model import attention

        d_k = 64
        identity = np.eye(d_k, dtype=np.float32)

        result = attention(
            query=identity,
            key=identity,
            value=identity,
        )

        # Expected: softmax of identity scaled by sqrt(d_k)
        expected_attn = np.exp(identity / np.sqrt(d_k))
        expected_attn = expected_attn / expected_attn.sum(axis=-1, keepdims=True)
        expected = expected_attn @ identity

        np.testing.assert_allclose(result, expected, rtol=1e-5)

    def test_algorithm_1_step_by_step(self):
        """
        Test Algorithm 1 as described in Section 4.

        This verifies each step of the algorithm matches paper description.
        """
        from src.algorithm import Algorithm1

        algo = Algorithm1()

        # Step 1: Initialize (paper says initialize to zeros)
        state = algo.initialize(dim=10)
        assert np.allclose(state, np.zeros(10))

        # Step 2: Update rule (paper Eq. 5)
        input_data = np.ones(10)
        updated = algo.update(state, input_data)
        # According to paper: new_state = 0.9 * state + 0.1 * input
        expected = 0.9 * state + 0.1 * input_data
        np.testing.assert_allclose(updated, expected)

Phase 5: Create Parametrized Tests

python
"""Parametrized tests for comprehensive coverage."""

import pytest
import numpy as np

class TestParametrized:
    """Parametrized tests for various input scenarios."""

    @pytest.mark.parametrize("batch_size", [1, 4, 16, 32])
    def test_various_batch_sizes(self, model, batch_size):
        """Test model with different batch sizes."""
        x = torch.randn(batch_size, 10, 64)
        output = model(x)
        assert output.shape[0] == batch_size

    @pytest.mark.parametrize("seq_len", [1, 10, 100, 512])
    def test_various_sequence_lengths(self, model, seq_len):
        """Test model with different sequence lengths."""
        x = torch.randn(2, seq_len, 64)
        output = model(x)
        assert output.shape[1] == seq_len

    @pytest.mark.parametrize("input_type,expected_error", [
        (None, TypeError),
        ("string", TypeError),
        ([], ValueError),
    ])
    def test_invalid_inputs(self, model, input_type, expected_error):
        """Test model rejects invalid inputs."""
        with pytest.raises(expected_error):
            model(input_type)

Phase 6: Add Integration Tests (if --integration)

python
"""Integration tests for end-to-end workflows."""

import pytest
from unittest.mock import patch, MagicMock

class TestIntegration:
    """End-to-end integration tests."""

    @pytest.fixture
    def mock_external_services(self, mock_openai_client, mock_http_response):
        """Mock all external services."""
        with patch("src.llm.client", mock_openai_client), \
             patch("requests.get", return_value=mock_http_response), \
             patch("requests.post", return_value=mock_http_response):
            yield

    def test_full_pipeline(self, mock_external_services, sample_text_data):
        """Test complete processing pipeline."""
        from src.pipeline import Pipeline

        pipeline = Pipeline()

        # Run full pipeline with mocked externals
        result = pipeline.process(sample_text_data)

        assert result is not None
        assert "output" in result

    def test_error_recovery(self, mock_openai_client):
        """Test pipeline recovers from transient errors."""
        # First call fails, second succeeds
        mock_openai_client.chat.completions.create.side_effect = [
            Exception("Transient error"),
            MagicMock(choices=[MagicMock(message=MagicMock(content="Success"))])
        ]

        with patch("src.llm.client", mock_openai_client):
            from src.pipeline import Pipeline

            pipeline = Pipeline(retry_count=2)
            result = pipeline.process_with_retry("input")

            assert result == "Success"

Phase 7: Run and Verify Tests

  1. Run all tests:

    bash
    cd /Users/shibuiyusuke/tmp/paper2code
    uv run pytest paper_impl/{arxiv_id}/tests/ -v
    
  2. Check coverage:

    bash
    uv run pytest paper_impl/{arxiv_id}/tests/ --cov=paper_impl/{arxiv_id}/src --cov-report=term-missing
    
  3. Fix any failing tests and ensure coverage meets target

Mock Patterns Reference

Mocking LLM Clients

python
# OpenAI
@patch("openai.OpenAI")
def test_openai(mock_openai):
    mock_client = MagicMock()
    mock_openai.return_value = mock_client
    mock_client.chat.completions.create.return_value = MagicMock(
        choices=[MagicMock(message=MagicMock(content="response"))]
    )

# Anthropic
@patch("anthropic.Anthropic")
def test_anthropic(mock_anthropic):
    mock_client = MagicMock()
    mock_anthropic.return_value = mock_client
    mock_client.messages.create.return_value = MagicMock(
        content=[MagicMock(text="response")]
    )

# LangChain
@patch("langchain.llms.OpenAI")
def test_langchain(mock_llm):
    mock_llm.return_value.return_value = "mocked response"

Mocking Data Sources

python
# File reading
@patch("builtins.open", mock_open(read_data="mocked file content"))
def test_file_read():
    ...

# NumPy load
@patch("numpy.load")
def test_numpy_load(mock_load):
    mock_load.return_value = np.array([[1, 2], [3, 4]])

# Pandas read_csv
@patch("pandas.read_csv")
def test_pandas(mock_read):
    mock_read.return_value = pd.DataFrame({"col": [1, 2, 3]})

# HTTP requests
@patch("requests.get")
def test_http(mock_get):
    mock_get.return_value.json.return_value = {"key": "value"}
    mock_get.return_value.status_code = 200

Mocking Embedding Models

python
# Sentence Transformers
@patch("sentence_transformers.SentenceTransformer")
def test_embeddings(mock_st):
    mock_model = MagicMock()
    mock_st.return_value = mock_model
    mock_model.encode.return_value = np.random.randn(3, 384)

# OpenAI Embeddings
@patch("openai.OpenAI")
def test_openai_embeddings(mock_openai):
    mock_client = MagicMock()
    mock_openai.return_value = mock_client
    mock_client.embeddings.create.return_value = MagicMock(
        data=[MagicMock(embedding=[0.1] * 1536)]
    )

Output

After running this skill, the implementation will have:

  1. tests/conftest.py: Shared fixtures for mocking LLMs, data, and external services
  2. tests/test_*.py: Comprehensive test files for each module
  3. Coverage report: Showing test coverage percentage

Important Guidelines

  • Always mock external dependencies: Never make real API calls in tests
  • Use fixtures for reusable mocks: Define common mocks in conftest.py
  • Test edge cases: Include tests for empty inputs, large inputs, and error conditions
  • Reference the paper: Add docstrings explaining which paper section/equation is being tested
  • Keep tests fast: Use small tensors and minimal iterations
  • Test deterministically: Set random seeds where needed

Example Session

code
User: /add-tests 2506_08098v1

Claude: I'll add comprehensive pytest tests to the Cognitive Weave implementation.

[Phase 1: Locating implementation...]
Found: paper_impl/2506_08098v1/
Reading paper PDF to understand algorithm...

[Phase 2: Analyzing code...]
Found modules: data_structures.py, vectorial_resonator.py, strg.py, nexus_weaver.py
Identified LLM calls in: semantic_oracle.py
Identified embedding calls in: vectorial_resonator.py

[Phase 3: Creating test infrastructure...]
Created: tests/conftest.py with fixtures for mocking

[Phase 4: Writing unit tests...]
Created tests for all modules with mocked dependencies

[Phase 5: Running tests...]
All 47 tests passed
Coverage: 85%

Tests added successfully to paper_impl/2506_08098v1/tests/