Verify FireCrawl Crawled Content
This skill verifies that Markdown files scraped by FireCrawl accurately reflect the content of the original web pages.
When to Use
- •User asks to verify crawled/scraped Markdown files
- •User wants to check if crawled content matches original web pages
- •User mentions checking for errors or missing content in scraped files
Workflow
- •Read the Markdown file specified by the user
- •Extract metadata from the YAML frontmatter:
- •
source_url: The original URL that was scraped - •
scraped_at: When the content was scraped
- •
- •Fetch the original web page using the
WebFetchtool with thesource_url - •Compare content between the Markdown file and the freshly fetched content
- •Generate verification report
File Format Expected
The Markdown files should have this structure:
markdown
--- source_url: https://example.com/page scraped_at: 2026-01-09 --- # Page Title Content...
Comparison Criteria
When comparing, check for:
- •Title Match: Does the main heading match?
- •Key Sections Present: Are all major sections from the original present in the scraped file?
- •Important Data Accuracy: Are dates, names, numbers accurately captured?
- •Link Integrity: Are important links preserved?
- •Content Completeness: Is there significant missing content?
Output Format
Return the verification result in this format:
code
File: <file_path> URL: <source_url> Status: PASS | FAIL Comment: <brief explanation of findings>
Status Definitions
- •PASS: The Markdown file accurately represents the original web page content with no significant discrepancies
- •FAIL: There are notable differences, missing content, or errors between the Markdown and the original page
Example Usage
User: "Verify the crawled file at data/aaai-26/bridge-program/bridge-program.md"
Steps:
- •Read
data/aaai-26/bridge-program/bridge-program.md - •Extract
source_urlfrom frontmatter - •Fetch the original URL using
WebFetchtool - •Compare the two versions
- •Output verification report
Important Notes
- •If the original page has been updated since scraping, note this in the comment
- •Focus on content accuracy, not formatting differences (minor whitespace/formatting differences are acceptable)
- •If the page requires JavaScript rendering, mention this as a potential cause for discrepancies