Web Scraping Skill

Description

Extract data from web pages using JavaScript and CSS selectors. This skill uses real browser automation to navigate to pages and extract content using JavaScript execution. It can extract webpage titles using document.title and body text using document.body.textContent, providing actual content from live web pages.

Dependencies

None

Parameters

Parameter	Type	Required	Default	Description
url	String	Yes	-	The URL of the web page to scrape
selector	String	Yes	-	CSS selector to target specific elements
attributes	List<String>	No	["text"]	List of attributes to extract (e.g., "text", "href", "src")

Return Value

Returns a SkillResult with the following data structure:

json

{
  "url": "string",
  "selector": "string",
  "attributes": ["string"],
  "data": "extracted content"
}

Usage Examples

Basic Text Extraction

kotlin

val result = registry.execute(
    skillId = "web-scraping",
    context = context,
    params = mapOf(
        "url" to "https://example.com",
        "selector" to ".content"
    )
)

Extract Multiple Attributes

kotlin

val result = registry.execute(
    skillId = "web-scraping",
    context = context,
    params = mapOf(
        "url" to "https://example.com/products",
        "selector" to "a.product-link",
        "attributes" to listOf("text", "href")
    )
)

Tool Call Specification

kotlin

ToolSpec(
    domain = "skill.debug.scraping",
    method = "extract",
    arguments = [
        "url: String",
        "selector: String",
        "attributes: List<String> = listOf('text')"
    ],
    returnType = "Map<String, Any>",
    description = "Extract data from a web page using CSS selectors"
)

Error Handling

The skill returns a failure result in the following cases:

•Missing required parameter url
•Missing required parameter selector
•Invalid URL format (must start with http:// or https://)

Lifecycle Hooks

onLoad

Initializes resources and loads configurations when the skill is registered.

onBeforeExecute

Validates URL format before execution. Returns false if URL doesn't start with http:// or https://.

onAfterExecute

Records the timestamp of successful scraping operations in shared resources.

validate

Validates skill configuration and environment. Always returns true for this skill.

Implementation Notes

•Uses JavaScript execution via WebDriver.evaluate() to extract real webpage content
•Extracts document title using document.title
•Extracts page text using document.body.textContent
•Falls back to PulsarSession-based extraction if WebDriver is not directly available
•Returns simulated data only if neither WebDriver nor PulsarSession is available in the skill context
•Supports both synchronous and asynchronous execution patterns
•Thread-safe and can be used in concurrent environments
•Text content is limited to 5000 characters to avoid overwhelming responses

Web Scraping Skill

Description

Dependencies

Parameters

Return Value

Usage Examples

Basic Text Extraction

Extract Multiple Attributes

Tool Call Specification

Error Handling

Lifecycle Hooks

onLoad

onBeforeExecute

onAfterExecute

validate

Implementation Notes

See Also