Doc Site Analysis Skill
When analyzing an API documentation page to understand its structure and extraction patterns, follow this systematic approach.
Core Objective
Your goal is to discover how a documentation site organizes its API reference content so you can later write code to extract that content deterministically. You're not extracting data now—you're discovering the patterns that will inform scraper code generation.
Phase 1: Initial Reconnaissance
Fetch the target URL and observe the raw HTML structure. Look for:
- •Page framework signatures - Most doc sites use static site generators with distinctive patterns
- •Content organization - How is the page divided into sections?
- •Navigation structure - Sidebar, breadcrumbs, table of contents
- •Endpoint listing patterns - Tables, lists, or inline definitions
Document your observations before proceeding. The goal is to understand the "shape" of the documentation.
Phase 2: Identify the Index Pattern
API documentation typically provides an index of available endpoints. Common patterns:
Quick Reference Tables
Look for tables with columns like Method/Resource/Description or similar. These are gold—they give you a complete list of endpoints in structured form.
<table>
<tr><th>Type</th><th>Resource</th><th>Description</th></tr>
<tr>
<td>GET</td>
<td><a href="#get-connection">connections/:connection_id</a></td>
<td>Get connection details</td>
</tr>
</table>
Key signals:
- •Table has HTTP method column (GET, POST, PUT, DELETE, PATCH)
- •Links contain anchor fragments (
#heading-id) pointing to detail sections - •Resource column contains URL paths with parameters
Sidebar Navigation
Some sites list endpoints in the sidebar. Look for:
- •Nested lists under section headings
- •Links to anchors or separate pages
- •HTTP method badges or prefixes
Section Headings as Index
When no explicit index exists, the headings themselves form the index:
- •H2/H3 elements containing method + path patterns
- •Pattern: "GET /api/users" or "List Users (GET)"
Phase 3: Map Section Boundaries
Once you know where endpoints are listed, understand how detail sections are organized:
- •Heading-based sections - Each endpoint gets an H2/H3, content until next heading belongs to it
- •Container-based sections - Each endpoint wrapped in a div with class or ID
- •Flat structure - All content flows sequentially, headings are the only markers
For heading-based sections (most common), note:
- •The heading level used (H2, H3)
- •Whether headings have IDs (essential for anchor linking)
- •The pattern in heading text (method first? path first? description?)
Phase 4: Locate Content Elements
Within each endpoint section, identify where to find:
Parameters
Usually in tables with columns: Name, Type, Required, Description
<table> <tr><th>Name</th><th>Type</th><th>Required</th><th>Description</th></tr> <tr><td>id</td><td>integer</td><td>yes</td><td>User ID</td></tr> </table>
Look for section subheadings like "Request Parameters", "Query Parameters", "Body Parameters"
Request/Response Examples
Code blocks with language hints:
<pre><code class="language-json">{"name": "example"}</code></pre>
<pre><code class="language-curl">curl -X GET ...</code></pre>
Or syntax-highlighted divs:
<div class="highlight-json"><pre>...</pre></div>
Descriptions
Prose paragraphs between the heading and first table/code block. May contain important context about authentication, rate limits, or special behavior.
Phase 5: Detect Edge Cases
Real documentation has inconsistencies. Look for:
- •Broken anchor links - Index links that don't match heading IDs
- •Multiple table formats - Different pages may use different table structures
- •Nested content - Examples inside collapsible sections or tabs
- •Generated IDs - Headings without explicit IDs, where browser generates them from text
Document any patterns that differ from the main structure.
Output: Site Analysis Document
After analysis, produce a structured document containing:
site:
name: "Workato API Documentation"
base_url: "https://docs.workato.com"
framework: "VuePress" # or Docusaurus, ReadMe, custom, unknown
index_pattern:
type: "quick_reference_table" # or sidebar, headings, list
location: "top of page"
columns: ["Type", "Resource", "Description"]
link_column: "Resource"
anchor_format: "#heading-id"
section_pattern:
type: "heading_based"
heading_level: "h2"
id_source: "explicit" # or generated
text_format: "{description}" # e.g., "Get connection details"
content_elements:
parameters:
type: "table"
columns: ["Name", "Type", "Required", "Description"]
request_examples:
type: "code_block"
languages: ["curl", "json"]
container: "pre > code"
response_examples:
type: "code_block"
languages: ["json"]
container: "div.highlight-json pre"
edge_cases:
- "First endpoint in some pages has broken anchor link"
- "Some tables use 'Required?' instead of 'Required'"
This structured output becomes the input for scraper code generation.
Reference Materials
For detailed information on specific topics:
- •
references/html-patterns.md- Common HTML structures for tables, code blocks, navigation - •
references/framework-signatures.md- How to identify VuePress, Docusaurus, ReadMe, etc. - •
references/extraction-strategies.md- Parsing approaches for different patterns
For a worked example:
- •
examples/workato-analysis.md- Complete analysis of Workato API documentation