/add-tests

Reads an arXiv paper and adds comprehensive pytest tests to its implementation, using mocks for LLM calls and external data.

Usage

code

/add-tests <arxiv_id> [options]

Arguments

•<arxiv_id>: arXiv paper ID (e.g., 2301.00001) or implementation directory name (e.g., 2301_00001)

Options

•--coverage: Target test coverage percentage (default: 80)
•--mock-all: Mock all external dependencies including file I/O
•--integration: Also generate integration tests

Workflow

When this skill is invoked, follow these phases in order:

Phase 1: Locate Implementation and Paper

•

Find the implementation directory

bash

cd /Users/shibuiyusuke/tmp/paper2code
# List available implementations
ls paper_impl/

•

Verify the implementation exists

bash

# Check structure of target implementation
ls -la paper_impl/{arxiv_id}/
ls -la paper_impl/{arxiv_id}/src/

•
Read the paper PDF using the Read tool:
- •
  Read the PDF at paper_impl/{arxiv_id}/ to understand:
  - •The algorithm/model being implemented
  - •Expected inputs/outputs
  - •Edge cases mentioned in the paper
  - •Numerical examples that can be used as test cases

Phase 2: Analyze Existing Code

•
Read all source files in paper_impl/{arxiv_id}/src/:
- •Identify all classes, functions, and methods
- •Note which functions call LLMs or external APIs
- •Identify data loading and processing functions
- •Document the public interface that needs testing
•
Identify mock requirements:
- •LLM calls: Functions that call OpenAI, Anthropic, or other LLM APIs
- •External APIs: HTTP requests, database queries
- •File I/O: Reading/writing files, especially PDFs or large datasets
- •Network requests: Any requests or httpx calls
•
Check for existing tests:
bash
```
ls paper_impl/{arxiv_id}/tests/
```

Phase 3: Create Test Infrastructure

•

Create test directory structure (if not exists):

code

paper_impl/{arxiv_id}/tests/
├── __init__.py
├── conftest.py          # Shared fixtures and mocks
├── test_model.py        # Core model tests
├── test_layers.py       # Layer/module tests
├── test_utils.py        # Utility function tests
└── test_integration.py  # Integration tests (if --integration)

•

Create conftest.py with common fixtures:

python

"""Shared fixtures and mocks for testing."""

import pytest
from unittest.mock import Mock, MagicMock, patch
import numpy as np
import torch  # or appropriate framework

# === LLM Mock Fixtures ===

@pytest.fixture
def mock_llm_response():
    """Mock LLM response for testing."""
    return {
        "choices": [{
            "message": {
                "content": "Mocked LLM response for testing purposes.",
                "role": "assistant"
            }
        }],
        "usage": {"prompt_tokens": 10, "completion_tokens": 20}
    }

@pytest.fixture
def mock_openai_client(mock_llm_response):
    """Mock OpenAI client."""
    mock_client = MagicMock()
    mock_client.chat.completions.create.return_value = MagicMock(
        choices=[MagicMock(message=MagicMock(content="Mocked response"))]
    )
    return mock_client

@pytest.fixture
def mock_anthropic_client():
    """Mock Anthropic client."""
    mock_client = MagicMock()
    mock_client.messages.create.return_value = MagicMock(
        content=[MagicMock(text="Mocked response")]
    )
    return mock_client

# === Data Mock Fixtures ===

@pytest.fixture
def sample_embedding():
    """Sample embedding vector for testing."""
    return np.random.randn(768).astype(np.float32)

@pytest.fixture
def sample_batch_embeddings():
    """Batch of sample embeddings."""
    return np.random.randn(8, 768).astype(np.float32)

@pytest.fixture
def sample_text_data():
    """Sample text data for testing."""
    return [
        "This is a test sentence for embedding.",
        "Another example text for testing purposes.",
        "Machine learning models need training data.",
    ]

@pytest.fixture
def mock_pdf_content():
    """Mock PDF content."""
    return """
    Abstract: This paper presents a novel approach...
    1. Introduction
    The problem of X is challenging because...
    2. Method
    We propose Algorithm 1 which...
    """

# === Network/API Mock Fixtures ===

@pytest.fixture
def mock_http_response():
    """Mock HTTP response."""
    mock_response = MagicMock()
    mock_response.status_code = 200
    mock_response.json.return_value = {"data": "mocked"}
    mock_response.text = "Mocked response text"
    return mock_response

# === Tensor Fixtures (for PyTorch-based implementations) ===

@pytest.fixture
def sample_tensor():
    """Sample tensor for testing."""
    return torch.randn(2, 10, 64)  # (batch, seq_len, dim)

@pytest.fixture
def device():
    """Get appropriate device."""
    return torch.device("cuda" if torch.cuda.is_available() else "cpu")

Phase 4: Write Unit Tests

Write tests following these patterns:

4.1 Testing Functions with LLM Calls

python

"""Tests for modules that call LLMs."""

import pytest
from unittest.mock import patch, MagicMock

class TestLLMModule:
    """Tests for LLM-dependent functionality."""

    def test_llm_call_with_mock(self, mock_openai_client):
        """Test function that calls LLM with mocked response."""
        with patch("src.module.openai_client", mock_openai_client):
            from src.module import process_with_llm

            result = process_with_llm("test input")

            # Verify LLM was called correctly
            mock_openai_client.chat.completions.create.assert_called_once()
            assert result is not None

    def test_llm_error_handling(self):
        """Test graceful handling of LLM errors."""
        with patch("src.module.openai_client") as mock_client:
            mock_client.chat.completions.create.side_effect = Exception("API Error")

            from src.module import process_with_llm

            with pytest.raises(Exception):
                process_with_llm("test input")

    @pytest.fixture
    def mock_structured_response(self):
        """Mock for structured LLM output."""
        return {
            "analysis": "mocked analysis",
            "confidence": 0.95,
            "recommendations": ["rec1", "rec2"]
        }

    def test_structured_llm_output(self, mock_structured_response):
        """Test parsing of structured LLM output."""
        with patch("src.module.call_llm") as mock_call:
            import json
            mock_call.return_value = json.dumps(mock_structured_response)

            from src.module import get_structured_analysis

            result = get_structured_analysis("input")
            assert "analysis" in result
            assert result["confidence"] == 0.95

4.2 Testing Data Processing with Mocked Data

python

"""Tests for data processing functions."""

import pytest
from unittest.mock import Mock, patch, mock_open
import numpy as np

class TestDataProcessing:
    """Tests for data loading and processing."""

    def test_load_embeddings(self, sample_batch_embeddings):
        """Test embedding loading with mocked data."""
        with patch("numpy.load") as mock_load:
            mock_load.return_value = sample_batch_embeddings

            from src.data_loader import load_embeddings

            result = load_embeddings("fake_path.npy")
            assert result.shape == (8, 768)

    def test_process_pdf(self, mock_pdf_content):
        """Test PDF processing with mocked content."""
        with patch("src.pdf_parser.extract_text") as mock_extract:
            mock_extract.return_value = mock_pdf_content

            from src.pdf_parser import process_paper

            result = process_paper("fake_paper.pdf")
            assert "Abstract" in result

    def test_file_not_found(self):
        """Test handling of missing files."""
        with patch("builtins.open", side_effect=FileNotFoundError):
            from src.data_loader import load_config

            with pytest.raises(FileNotFoundError):
                load_config("nonexistent.yaml")

4.3 Testing Model Components

python

"""Tests for model/algorithm components."""

import pytest
import torch
import numpy as np

class TestModel:
    """Tests for core model functionality."""

    @pytest.fixture
    def model_config(self):
        """Create test configuration."""
        from src.config import Config
        return Config(
            hidden_dim=64,
            num_layers=2,
            dropout=0.0,  # Disable dropout for deterministic tests
        )

    @pytest.fixture
    def model(self, model_config):
        """Create model instance."""
        from src.model import Model
        return Model(model_config)

    def test_forward_pass_shape(self, model, sample_tensor):
        """Test output shape matches expected."""
        output = model(sample_tensor)
        assert output.shape == sample_tensor.shape

    def test_gradient_flow(self, model, sample_tensor):
        """Test gradients flow through model."""
        sample_tensor.requires_grad_(True)
        output = model(sample_tensor)
        loss = output.sum()
        loss.backward()

        assert sample_tensor.grad is not None
        assert not torch.isnan(sample_tensor.grad).any()

    def test_deterministic_output(self, model, sample_tensor):
        """Test model produces consistent output."""
        torch.manual_seed(42)
        output1 = model(sample_tensor)

        torch.manual_seed(42)
        output2 = model(sample_tensor)

        assert torch.allclose(output1, output2)

    def test_batch_independence(self, model):
        """Test batch elements are processed independently."""
        x = torch.randn(4, 10, 64)

        # Process full batch
        full_output = model(x)

        # Process individual samples
        individual_outputs = torch.stack([model(x[i:i+1]) for i in range(4)])

        assert torch.allclose(full_output, individual_outputs.squeeze(1), atol=1e-5)

    def test_numerical_stability(self, model):
        """Test model handles edge cases."""
        # Very small values
        small_input = torch.randn(2, 10, 64) * 1e-6
        output_small = model(small_input)
        assert not torch.isnan(output_small).any()
        assert not torch.isinf(output_small).any()

        # Very large values
        large_input = torch.randn(2, 10, 64) * 1e3
        output_large = model(large_input)
        assert not torch.isnan(output_large).any()

4.4 Testing with Paper-Specific Examples

python

"""Tests using examples from the paper."""

import pytest
import numpy as np

class TestPaperExamples:
    """Tests derived from examples in the paper."""

    def test_equation_3_implementation(self):
        """
        Test implementation of Eq. (3) from Section 3.2.

        According to the paper:
        y = softmax(QK^T / sqrt(d_k)) @ V

        With Q = K = V = I (identity), output should equal softmax(I/sqrt(d)) @ I
        """
        from src.model import attention

        d_k = 64
        identity = np.eye(d_k, dtype=np.float32)

        result = attention(
            query=identity,
            key=identity,
            value=identity,
        )

        # Expected: softmax of identity scaled by sqrt(d_k)
        expected_attn = np.exp(identity / np.sqrt(d_k))
        expected_attn = expected_attn / expected_attn.sum(axis=-1, keepdims=True)
        expected = expected_attn @ identity

        np.testing.assert_allclose(result, expected, rtol=1e-5)

    def test_algorithm_1_step_by_step(self):
        """
        Test Algorithm 1 as described in Section 4.

        This verifies each step of the algorithm matches paper description.
        """
        from src.algorithm import Algorithm1

        algo = Algorithm1()

        # Step 1: Initialize (paper says initialize to zeros)
        state = algo.initialize(dim=10)
        assert np.allclose(state, np.zeros(10))

        # Step 2: Update rule (paper Eq. 5)
        input_data = np.ones(10)
        updated = algo.update(state, input_data)
        # According to paper: new_state = 0.9 * state + 0.1 * input
        expected = 0.9 * state + 0.1 * input_data
        np.testing.assert_allclose(updated, expected)

Phase 5: Create Parametrized Tests

python

"""Parametrized tests for comprehensive coverage."""

import pytest
import numpy as np

class TestParametrized:
    """Parametrized tests for various input scenarios."""

    @pytest.mark.parametrize("batch_size", [1, 4, 16, 32])
    def test_various_batch_sizes(self, model, batch_size):
        """Test model with different batch sizes."""
        x = torch.randn(batch_size, 10, 64)
        output = model(x)
        assert output.shape[0] == batch_size

    @pytest.mark.parametrize("seq_len", [1, 10, 100, 512])
    def test_various_sequence_lengths(self, model, seq_len):
        """Test model with different sequence lengths."""
        x = torch.randn(2, seq_len, 64)
        output = model(x)
        assert output.shape[1] == seq_len

    @pytest.mark.parametrize("input_type,expected_error", [
        (None, TypeError),
        ("string", TypeError),
        ([], ValueError),
    ])
    def test_invalid_inputs(self, model, input_type, expected_error):
        """Test model rejects invalid inputs."""
        with pytest.raises(expected_error):
            model(input_type)

Phase 6: Add Integration Tests (if --integration)

python

"""Integration tests for end-to-end workflows."""

import pytest
from unittest.mock import patch, MagicMock

class TestIntegration:
    """End-to-end integration tests."""

    @pytest.fixture
    def mock_external_services(self, mock_openai_client, mock_http_response):
        """Mock all external services."""
        with patch("src.llm.client", mock_openai_client), \
             patch("requests.get", return_value=mock_http_response), \
             patch("requests.post", return_value=mock_http_response):
            yield

    def test_full_pipeline(self, mock_external_services, sample_text_data):
        """Test complete processing pipeline."""
        from src.pipeline import Pipeline

        pipeline = Pipeline()

        # Run full pipeline with mocked externals
        result = pipeline.process(sample_text_data)

        assert result is not None
        assert "output" in result

    def test_error_recovery(self, mock_openai_client):
        """Test pipeline recovers from transient errors."""
        # First call fails, second succeeds
        mock_openai_client.chat.completions.create.side_effect = [
            Exception("Transient error"),
            MagicMock(choices=[MagicMock(message=MagicMock(content="Success"))])
        ]

        with patch("src.llm.client", mock_openai_client):
            from src.pipeline import Pipeline

            pipeline = Pipeline(retry_count=2)
            result = pipeline.process_with_retry("input")

            assert result == "Success"

Phase 7: Run and Verify Tests

•

Run all tests:

bash

cd /Users/shibuiyusuke/tmp/paper2code
uv run pytest paper_impl/{arxiv_id}/tests/ -v

•

Check coverage:

bash

uv run pytest paper_impl/{arxiv_id}/tests/ --cov=paper_impl/{arxiv_id}/src --cov-report=term-missing

•
Fix any failing tests and ensure coverage meets target

Mock Patterns Reference

Mocking LLM Clients

python

# OpenAI
@patch("openai.OpenAI")
def test_openai(mock_openai):
    mock_client = MagicMock()
    mock_openai.return_value = mock_client
    mock_client.chat.completions.create.return_value = MagicMock(
        choices=[MagicMock(message=MagicMock(content="response"))]
    )

# Anthropic
@patch("anthropic.Anthropic")
def test_anthropic(mock_anthropic):
    mock_client = MagicMock()
    mock_anthropic.return_value = mock_client
    mock_client.messages.create.return_value = MagicMock(
        content=[MagicMock(text="response")]
    )

# LangChain
@patch("langchain.llms.OpenAI")
def test_langchain(mock_llm):
    mock_llm.return_value.return_value = "mocked response"

Mocking Data Sources

python

# File reading
@patch("builtins.open", mock_open(read_data="mocked file content"))
def test_file_read():
    ...

# NumPy load
@patch("numpy.load")
def test_numpy_load(mock_load):
    mock_load.return_value = np.array([[1, 2], [3, 4]])

# Pandas read_csv
@patch("pandas.read_csv")
def test_pandas(mock_read):
    mock_read.return_value = pd.DataFrame({"col": [1, 2, 3]})

# HTTP requests
@patch("requests.get")
def test_http(mock_get):
    mock_get.return_value.json.return_value = {"key": "value"}
    mock_get.return_value.status_code = 200

Mocking Embedding Models

python

# Sentence Transformers
@patch("sentence_transformers.SentenceTransformer")
def test_embeddings(mock_st):
    mock_model = MagicMock()
    mock_st.return_value = mock_model
    mock_model.encode.return_value = np.random.randn(3, 384)

# OpenAI Embeddings
@patch("openai.OpenAI")
def test_openai_embeddings(mock_openai):
    mock_client = MagicMock()
    mock_openai.return_value = mock_client
    mock_client.embeddings.create.return_value = MagicMock(
        data=[MagicMock(embedding=[0.1] * 1536)]
    )

Output

After running this skill, the implementation will have:

•tests/conftest.py: Shared fixtures for mocking LLMs, data, and external services
•tests/test_*.py: Comprehensive test files for each module
•Coverage report: Showing test coverage percentage

Important Guidelines

•Always mock external dependencies: Never make real API calls in tests
•Use fixtures for reusable mocks: Define common mocks in conftest.py
•Test edge cases: Include tests for empty inputs, large inputs, and error conditions
•Reference the paper: Add docstrings explaining which paper section/equation is being tested
•Keep tests fast: Use small tensors and minimal iterations
•Test deterministically: Set random seeds where needed

Example Session

code

User: /add-tests 2506_08098v1

Claude: I'll add comprehensive pytest tests to the Cognitive Weave implementation.

[Phase 1: Locating implementation...]
Found: paper_impl/2506_08098v1/
Reading paper PDF to understand algorithm...

[Phase 2: Analyzing code...]
Found modules: data_structures.py, vectorial_resonator.py, strg.py, nexus_weaver.py
Identified LLM calls in: semantic_oracle.py
Identified embedding calls in: vectorial_resonator.py

[Phase 3: Creating test infrastructure...]
Created: tests/conftest.py with fixtures for mocking

[Phase 4: Writing unit tests...]
Created tests for all modules with mocked dependencies

[Phase 5: Running tests...]
All 47 tests passed
Coverage: 85%

Tests added successfully to paper_impl/2506_08098v1/tests/