/add-tests
Reads an arXiv paper and adds comprehensive pytest tests to its implementation, using mocks for LLM calls and external data.
Usage
code
/add-tests <arxiv_id> [options]
Arguments
- •
<arxiv_id>: arXiv paper ID (e.g.,2301.00001) or implementation directory name (e.g.,2301_00001)
Options
- •
--coverage: Target test coverage percentage (default:80) - •
--mock-all: Mock all external dependencies including file I/O - •
--integration: Also generate integration tests
Workflow
When this skill is invoked, follow these phases in order:
Phase 1: Locate Implementation and Paper
- •
Find the implementation directory
bashcd /Users/shibuiyusuke/tmp/paper2code # List available implementations ls paper_impl/
- •
Verify the implementation exists
bash# Check structure of target implementation ls -la paper_impl/{arxiv_id}/ ls -la paper_impl/{arxiv_id}/src/ - •
Read the paper PDF using the Read tool:
- •Read the PDF at
paper_impl/{arxiv_id}/to understand:- •The algorithm/model being implemented
- •Expected inputs/outputs
- •Edge cases mentioned in the paper
- •Numerical examples that can be used as test cases
- •Read the PDF at
Phase 2: Analyze Existing Code
- •
Read all source files in
paper_impl/{arxiv_id}/src/:- •Identify all classes, functions, and methods
- •Note which functions call LLMs or external APIs
- •Identify data loading and processing functions
- •Document the public interface that needs testing
- •
Identify mock requirements:
- •LLM calls: Functions that call OpenAI, Anthropic, or other LLM APIs
- •External APIs: HTTP requests, database queries
- •File I/O: Reading/writing files, especially PDFs or large datasets
- •Network requests: Any
requestsorhttpxcalls
- •
Check for existing tests:
bashls paper_impl/{arxiv_id}/tests/
Phase 3: Create Test Infrastructure
- •
Create test directory structure (if not exists):
codepaper_impl/{arxiv_id}/tests/ ├── __init__.py ├── conftest.py # Shared fixtures and mocks ├── test_model.py # Core model tests ├── test_layers.py # Layer/module tests ├── test_utils.py # Utility function tests └── test_integration.py # Integration tests (if --integration) - •
Create conftest.py with common fixtures:
python"""Shared fixtures and mocks for testing.""" import pytest from unittest.mock import Mock, MagicMock, patch import numpy as np import torch # or appropriate framework # === LLM Mock Fixtures === @pytest.fixture def mock_llm_response(): """Mock LLM response for testing.""" return { "choices": [{ "message": { "content": "Mocked LLM response for testing purposes.", "role": "assistant" } }], "usage": {"prompt_tokens": 10, "completion_tokens": 20} } @pytest.fixture def mock_openai_client(mock_llm_response): """Mock OpenAI client.""" mock_client = MagicMock() mock_client.chat.completions.create.return_value = MagicMock( choices=[MagicMock(message=MagicMock(content="Mocked response"))] ) return mock_client @pytest.fixture def mock_anthropic_client(): """Mock Anthropic client.""" mock_client = MagicMock() mock_client.messages.create.return_value = MagicMock( content=[MagicMock(text="Mocked response")] ) return mock_client # === Data Mock Fixtures === @pytest.fixture def sample_embedding(): """Sample embedding vector for testing.""" return np.random.randn(768).astype(np.float32) @pytest.fixture def sample_batch_embeddings(): """Batch of sample embeddings.""" return np.random.randn(8, 768).astype(np.float32) @pytest.fixture def sample_text_data(): """Sample text data for testing.""" return [ "This is a test sentence for embedding.", "Another example text for testing purposes.", "Machine learning models need training data.", ] @pytest.fixture def mock_pdf_content(): """Mock PDF content.""" return """ Abstract: This paper presents a novel approach... 1. Introduction The problem of X is challenging because... 2. Method We propose Algorithm 1 which... """ # === Network/API Mock Fixtures === @pytest.fixture def mock_http_response(): """Mock HTTP response.""" mock_response = MagicMock() mock_response.status_code = 200 mock_response.json.return_value = {"data": "mocked"} mock_response.text = "Mocked response text" return mock_response # === Tensor Fixtures (for PyTorch-based implementations) === @pytest.fixture def sample_tensor(): """Sample tensor for testing.""" return torch.randn(2, 10, 64) # (batch, seq_len, dim) @pytest.fixture def device(): """Get appropriate device.""" return torch.device("cuda" if torch.cuda.is_available() else "cpu")
Phase 4: Write Unit Tests
Write tests following these patterns:
4.1 Testing Functions with LLM Calls
python
"""Tests for modules that call LLMs."""
import pytest
from unittest.mock import patch, MagicMock
class TestLLMModule:
"""Tests for LLM-dependent functionality."""
def test_llm_call_with_mock(self, mock_openai_client):
"""Test function that calls LLM with mocked response."""
with patch("src.module.openai_client", mock_openai_client):
from src.module import process_with_llm
result = process_with_llm("test input")
# Verify LLM was called correctly
mock_openai_client.chat.completions.create.assert_called_once()
assert result is not None
def test_llm_error_handling(self):
"""Test graceful handling of LLM errors."""
with patch("src.module.openai_client") as mock_client:
mock_client.chat.completions.create.side_effect = Exception("API Error")
from src.module import process_with_llm
with pytest.raises(Exception):
process_with_llm("test input")
@pytest.fixture
def mock_structured_response(self):
"""Mock for structured LLM output."""
return {
"analysis": "mocked analysis",
"confidence": 0.95,
"recommendations": ["rec1", "rec2"]
}
def test_structured_llm_output(self, mock_structured_response):
"""Test parsing of structured LLM output."""
with patch("src.module.call_llm") as mock_call:
import json
mock_call.return_value = json.dumps(mock_structured_response)
from src.module import get_structured_analysis
result = get_structured_analysis("input")
assert "analysis" in result
assert result["confidence"] == 0.95
4.2 Testing Data Processing with Mocked Data
python
"""Tests for data processing functions."""
import pytest
from unittest.mock import Mock, patch, mock_open
import numpy as np
class TestDataProcessing:
"""Tests for data loading and processing."""
def test_load_embeddings(self, sample_batch_embeddings):
"""Test embedding loading with mocked data."""
with patch("numpy.load") as mock_load:
mock_load.return_value = sample_batch_embeddings
from src.data_loader import load_embeddings
result = load_embeddings("fake_path.npy")
assert result.shape == (8, 768)
def test_process_pdf(self, mock_pdf_content):
"""Test PDF processing with mocked content."""
with patch("src.pdf_parser.extract_text") as mock_extract:
mock_extract.return_value = mock_pdf_content
from src.pdf_parser import process_paper
result = process_paper("fake_paper.pdf")
assert "Abstract" in result
def test_file_not_found(self):
"""Test handling of missing files."""
with patch("builtins.open", side_effect=FileNotFoundError):
from src.data_loader import load_config
with pytest.raises(FileNotFoundError):
load_config("nonexistent.yaml")
4.3 Testing Model Components
python
"""Tests for model/algorithm components."""
import pytest
import torch
import numpy as np
class TestModel:
"""Tests for core model functionality."""
@pytest.fixture
def model_config(self):
"""Create test configuration."""
from src.config import Config
return Config(
hidden_dim=64,
num_layers=2,
dropout=0.0, # Disable dropout for deterministic tests
)
@pytest.fixture
def model(self, model_config):
"""Create model instance."""
from src.model import Model
return Model(model_config)
def test_forward_pass_shape(self, model, sample_tensor):
"""Test output shape matches expected."""
output = model(sample_tensor)
assert output.shape == sample_tensor.shape
def test_gradient_flow(self, model, sample_tensor):
"""Test gradients flow through model."""
sample_tensor.requires_grad_(True)
output = model(sample_tensor)
loss = output.sum()
loss.backward()
assert sample_tensor.grad is not None
assert not torch.isnan(sample_tensor.grad).any()
def test_deterministic_output(self, model, sample_tensor):
"""Test model produces consistent output."""
torch.manual_seed(42)
output1 = model(sample_tensor)
torch.manual_seed(42)
output2 = model(sample_tensor)
assert torch.allclose(output1, output2)
def test_batch_independence(self, model):
"""Test batch elements are processed independently."""
x = torch.randn(4, 10, 64)
# Process full batch
full_output = model(x)
# Process individual samples
individual_outputs = torch.stack([model(x[i:i+1]) for i in range(4)])
assert torch.allclose(full_output, individual_outputs.squeeze(1), atol=1e-5)
def test_numerical_stability(self, model):
"""Test model handles edge cases."""
# Very small values
small_input = torch.randn(2, 10, 64) * 1e-6
output_small = model(small_input)
assert not torch.isnan(output_small).any()
assert not torch.isinf(output_small).any()
# Very large values
large_input = torch.randn(2, 10, 64) * 1e3
output_large = model(large_input)
assert not torch.isnan(output_large).any()
4.4 Testing with Paper-Specific Examples
python
"""Tests using examples from the paper."""
import pytest
import numpy as np
class TestPaperExamples:
"""Tests derived from examples in the paper."""
def test_equation_3_implementation(self):
"""
Test implementation of Eq. (3) from Section 3.2.
According to the paper:
y = softmax(QK^T / sqrt(d_k)) @ V
With Q = K = V = I (identity), output should equal softmax(I/sqrt(d)) @ I
"""
from src.model import attention
d_k = 64
identity = np.eye(d_k, dtype=np.float32)
result = attention(
query=identity,
key=identity,
value=identity,
)
# Expected: softmax of identity scaled by sqrt(d_k)
expected_attn = np.exp(identity / np.sqrt(d_k))
expected_attn = expected_attn / expected_attn.sum(axis=-1, keepdims=True)
expected = expected_attn @ identity
np.testing.assert_allclose(result, expected, rtol=1e-5)
def test_algorithm_1_step_by_step(self):
"""
Test Algorithm 1 as described in Section 4.
This verifies each step of the algorithm matches paper description.
"""
from src.algorithm import Algorithm1
algo = Algorithm1()
# Step 1: Initialize (paper says initialize to zeros)
state = algo.initialize(dim=10)
assert np.allclose(state, np.zeros(10))
# Step 2: Update rule (paper Eq. 5)
input_data = np.ones(10)
updated = algo.update(state, input_data)
# According to paper: new_state = 0.9 * state + 0.1 * input
expected = 0.9 * state + 0.1 * input_data
np.testing.assert_allclose(updated, expected)
Phase 5: Create Parametrized Tests
python
"""Parametrized tests for comprehensive coverage."""
import pytest
import numpy as np
class TestParametrized:
"""Parametrized tests for various input scenarios."""
@pytest.mark.parametrize("batch_size", [1, 4, 16, 32])
def test_various_batch_sizes(self, model, batch_size):
"""Test model with different batch sizes."""
x = torch.randn(batch_size, 10, 64)
output = model(x)
assert output.shape[0] == batch_size
@pytest.mark.parametrize("seq_len", [1, 10, 100, 512])
def test_various_sequence_lengths(self, model, seq_len):
"""Test model with different sequence lengths."""
x = torch.randn(2, seq_len, 64)
output = model(x)
assert output.shape[1] == seq_len
@pytest.mark.parametrize("input_type,expected_error", [
(None, TypeError),
("string", TypeError),
([], ValueError),
])
def test_invalid_inputs(self, model, input_type, expected_error):
"""Test model rejects invalid inputs."""
with pytest.raises(expected_error):
model(input_type)
Phase 6: Add Integration Tests (if --integration)
python
"""Integration tests for end-to-end workflows."""
import pytest
from unittest.mock import patch, MagicMock
class TestIntegration:
"""End-to-end integration tests."""
@pytest.fixture
def mock_external_services(self, mock_openai_client, mock_http_response):
"""Mock all external services."""
with patch("src.llm.client", mock_openai_client), \
patch("requests.get", return_value=mock_http_response), \
patch("requests.post", return_value=mock_http_response):
yield
def test_full_pipeline(self, mock_external_services, sample_text_data):
"""Test complete processing pipeline."""
from src.pipeline import Pipeline
pipeline = Pipeline()
# Run full pipeline with mocked externals
result = pipeline.process(sample_text_data)
assert result is not None
assert "output" in result
def test_error_recovery(self, mock_openai_client):
"""Test pipeline recovers from transient errors."""
# First call fails, second succeeds
mock_openai_client.chat.completions.create.side_effect = [
Exception("Transient error"),
MagicMock(choices=[MagicMock(message=MagicMock(content="Success"))])
]
with patch("src.llm.client", mock_openai_client):
from src.pipeline import Pipeline
pipeline = Pipeline(retry_count=2)
result = pipeline.process_with_retry("input")
assert result == "Success"
Phase 7: Run and Verify Tests
- •
Run all tests:
bashcd /Users/shibuiyusuke/tmp/paper2code uv run pytest paper_impl/{arxiv_id}/tests/ -v - •
Check coverage:
bashuv run pytest paper_impl/{arxiv_id}/tests/ --cov=paper_impl/{arxiv_id}/src --cov-report=term-missing - •
Fix any failing tests and ensure coverage meets target
Mock Patterns Reference
Mocking LLM Clients
python
# OpenAI
@patch("openai.OpenAI")
def test_openai(mock_openai):
mock_client = MagicMock()
mock_openai.return_value = mock_client
mock_client.chat.completions.create.return_value = MagicMock(
choices=[MagicMock(message=MagicMock(content="response"))]
)
# Anthropic
@patch("anthropic.Anthropic")
def test_anthropic(mock_anthropic):
mock_client = MagicMock()
mock_anthropic.return_value = mock_client
mock_client.messages.create.return_value = MagicMock(
content=[MagicMock(text="response")]
)
# LangChain
@patch("langchain.llms.OpenAI")
def test_langchain(mock_llm):
mock_llm.return_value.return_value = "mocked response"
Mocking Data Sources
python
# File reading
@patch("builtins.open", mock_open(read_data="mocked file content"))
def test_file_read():
...
# NumPy load
@patch("numpy.load")
def test_numpy_load(mock_load):
mock_load.return_value = np.array([[1, 2], [3, 4]])
# Pandas read_csv
@patch("pandas.read_csv")
def test_pandas(mock_read):
mock_read.return_value = pd.DataFrame({"col": [1, 2, 3]})
# HTTP requests
@patch("requests.get")
def test_http(mock_get):
mock_get.return_value.json.return_value = {"key": "value"}
mock_get.return_value.status_code = 200
Mocking Embedding Models
python
# Sentence Transformers
@patch("sentence_transformers.SentenceTransformer")
def test_embeddings(mock_st):
mock_model = MagicMock()
mock_st.return_value = mock_model
mock_model.encode.return_value = np.random.randn(3, 384)
# OpenAI Embeddings
@patch("openai.OpenAI")
def test_openai_embeddings(mock_openai):
mock_client = MagicMock()
mock_openai.return_value = mock_client
mock_client.embeddings.create.return_value = MagicMock(
data=[MagicMock(embedding=[0.1] * 1536)]
)
Output
After running this skill, the implementation will have:
- •
tests/conftest.py: Shared fixtures for mocking LLMs, data, and external services - •
tests/test_*.py: Comprehensive test files for each module - •Coverage report: Showing test coverage percentage
Important Guidelines
- •Always mock external dependencies: Never make real API calls in tests
- •Use fixtures for reusable mocks: Define common mocks in conftest.py
- •Test edge cases: Include tests for empty inputs, large inputs, and error conditions
- •Reference the paper: Add docstrings explaining which paper section/equation is being tested
- •Keep tests fast: Use small tensors and minimal iterations
- •Test deterministically: Set random seeds where needed
Example Session
code
User: /add-tests 2506_08098v1 Claude: I'll add comprehensive pytest tests to the Cognitive Weave implementation. [Phase 1: Locating implementation...] Found: paper_impl/2506_08098v1/ Reading paper PDF to understand algorithm... [Phase 2: Analyzing code...] Found modules: data_structures.py, vectorial_resonator.py, strg.py, nexus_weaver.py Identified LLM calls in: semantic_oracle.py Identified embedding calls in: vectorial_resonator.py [Phase 3: Creating test infrastructure...] Created: tests/conftest.py with fixtures for mocking [Phase 4: Writing unit tests...] Created tests for all modules with mocked dependencies [Phase 5: Running tests...] All 47 tests passed Coverage: 85% Tests added successfully to paper_impl/2506_08098v1/tests/