Extract Webpage Data
Extract structured data from any web page using AI. Turn messy HTML into clean, organized data.
When to Use
- •User wants to extract specific data from a website
- •User asks to scrape information from a page
- •User needs structured data from unstructured content
- •User wants to pull product info, contact details, etc.
- •Converting web content to usable data
How It Works
Uses Olostep, Scrapegraph, or Riveter APIs for AI-powered data extraction.
Usage
Simple Scrape with Olostep
bash
orth run olostep /v1/scrapes -d '{"url_to_scrape":"https://example.com/products"}'
AI-Powered Extraction with Scrapegraph
bash
orth run scrapegraph /v1/smartscraper -d '{"website_url":"https://example.com/team","user_prompt":"Extract all team members with their names, titles, and LinkedIn URLs"}'
Schema-Based Extraction with Riveter
bash
orth run riveter /v1/scrape -d '{"url":"https://example.com","schema":{"name":"string","price":"number","description":"string"}}'
Get AI Answer from Web
bash
orth run olostep /v1/answers -d '{"task":"Find the pricing for Notion Teams plan from their website"}'
Crawl Multiple Pages
bash
orth run olostep /v1/crawls -d '{"start_url":"https://example.com","max_pages":10}'
Parameters
Olostep Scrape
- •url_to_scrape (required) - URL to scrape
- •formats - Output formats (markdown, html, text)
Scrapegraph
- •website_url (required) - URL to scrape
- •user_prompt (required) - Natural language description of what to extract
Riveter
- •url (required) - URL to scrape
- •schema - JSON schema defining the data structure to extract
Olostep Answer
- •task (required) - Natural language task/question
Response
Olostep Response
Returns a scrape object:
- •id (string) - Scrape ID (e.g.,
scrape_z926lxxon3) - •result.markdown_content (string|null) - Page content as markdown
- •result.html_content (string|null) - Raw HTML (if requested via
formats) - •result.text_content (string|null) - Plain text (if requested)
- •result.markdown_hosted_url (string|null) - S3 URL for large content
- •result.links_on_page (array) - Links found on the page
- •result.screenshot_hosted_url (string|null) - Screenshot URL (if requested)
- •result.page_metadata (object) -
status_codeof the page - •credits_consumed (integer) - Credits used for this scrape
Async crawls: POST /v1/crawls returns an id. Poll with GET /v1/crawls/{id} until complete.
Scrapegraph Response
Returns structured extraction result:
- •request_id (string) - Unique request identifier
- •status (string) -
completedorpending - •result (object) - AI-extracted data matching your prompt (dynamic keys)
- •error (string) - Empty on success, error message on failure
Note: For large pages, the POST may return status: "pending". Poll with GET /v1/smartscraper/{request_id} until status is completed.
Riveter Response
Returns scrape result:
- •request_status (string) -
successorerror - •message (string) - Human-readable status
- •text (string) - Extracted page text content
- •url (string) - URL that was scraped
- •status_code (integer) - HTTP status of the page
- •run_key (string) - Unique run identifier
- •base_url_for_links (string) - Base URL for resolving relative links
- •riveter_app_link (string) - Link to view run in Riveter dashboard
- •credit_used (integer) - Credits consumed
Examples
User: "Get all the product names and prices from this page"
bash
orth run scrapegraph /v1/smartscraper -d '{"website_url":"https://example.com/products","user_prompt":"Extract all products with name, price, and description"}'
User: "Scrape the team page and get everyone's info"
bash
orth run scrapegraph /v1/smartscraper -d '{"website_url":"https://example.com/about/team","user_prompt":"Extract team members: name, role, bio, photo URL, LinkedIn"}'
User: "What are Stripe's API pricing details?"
bash
orth run olostep /v1/answers -d '{"task":"Find Stripe API pricing breakdown from stripe.com/pricing"}'
User: "Get all blog post titles and dates from this blog"
bash
orth run riveter /v1/scrape -d '{"url":"https://blog.example.com","schema":{"posts":[{"title":"string","date":"string","url":"string"}]}}'
Error Handling
- •504 - Olostep timeout on slow pages — retry or try a simpler URL
- •400 - Missing required parameters (
url_to_scrapefor Olostep,website_url+user_promptfor Scrapegraph,urlfor Riveter) - •Scrapegraph returns
errorfield in response body — check it even on 200 status - •Riveter returns
request_status: "error"with details inmessage - •Some sites block automated scraping — try a different API if one fails
Tips
- •Scrapegraph is best for natural language extraction
- •Riveter is best when you know the exact schema you want
- •Olostep is great for general scraping and AI answers
- •For dynamic sites (JavaScript-heavy), these tools handle rendering
- •Be specific in your prompts for better extraction results
- •Some sites may block automated access