LLM Rankings Skill

Comprehensive evaluation and ranking system for comparing language models across performance, cost, and technical dimensions.

Core Capabilities

This skill provides four main ranking methodologies:

•Benchmark-Based Rankings - Objective comparisons using MMLU, GSM8K, HumanEval scores
•Task-Specific Rankings - Weighted recommendations for code generation, creative writing, reasoning, etc.
•Cost-Effectiveness Rankings - Performance per dollar analysis
•Real-World Performance - API reliability, documentation quality, ease of integration

When user asks "Which LLM is better for X?":

When user asks for comprehensive comparison:

When user describes a specific use case:

Load these files as needed to inform recommendations:

•benchmarks.md - Comprehensive benchmark scores (MMLU, GSM8K, HumanEval, MMMU, etc.)
•model-details.md - Technical specifications, context windows, API details, capabilities
•use-cases.md - Task-specific recommendations organised by common use cases
•pricing.md - Current pricing across all providers, cost optimisation strategies

Present concise recommendations with model name, key strength, pricing snapshot, and one-sentence justification.

Use markdown tables comparing models across relevant dimensions (performance, context window, pricing, best use).

Structure as:

•Evidence-Based - Support all rankings with benchmark data or documented performance
•Context-Aware - Consider user's specific requirements, budget, technical environment
•Transparent - Explain weighting decisions and ranking criteria clearly
•Current Information - Use web_search to verify latest releases, pricing changes, benchmark updates
•Practical Focus - Prioritise real-world usage factors over pure benchmark scores
•Balanced - Present strengths and weaknesses honestly for each model

•Benchmark Limitations - Benchmarks don't perfectly reflect real-world performance
•Task Specificity - A model's ranking varies significantly by use case
•Pricing Volatility - API pricing changes frequently; verify for important decisions
•Access Availability - Some models have waitlists or geographic restrictions
•Trade-offs - Larger context windows often mean slower processing

•Always verify current pricing and availability via web search for recent changes
•Consider user's deployment environment (API vs self-hosted)
•Account for additional costs (vision inputs, fine-tuning, enterprise features)
•Recommend testing on user's specific use case before committing
•Highlight when free tiers or trials are available

Provides comprehensive coverage of: