Anysite CLI

Command-line tool for web data extraction, dataset pipelines, and database operations.

Agent Planning Workflow

BEFORE planning any data collection task, follow this sequence:

•

Discover available endpoints

bash

anysite describe --search "<keyword>"   # Search by domain (linkedin, company, user, etc.)

•
Select endpoints needed for the task — identify which endpoints will provide the required data

•

Inspect each selected endpoint

bash

anysite describe /api/linkedin/company  # View input params and output fields

•
Only then plan — now you know the exact parameters, field names, and data structure to build your config or API calls

This prevents errors from wrong endpoint paths, missing required parameters, or incorrect field names in dependencies.

Best Practices

•
Use dataset pipelines for multi-step tasks
- •If a task requires sequential API calls, LLM enrichment, or chained data processing — create a dataset.yaml config instead of running multiple ad-hoc commands
- •Dataset pipelines handle dependencies, incremental collection, and error recovery automatically
- •Even for "simple" tasks that grow in scope, a dataset config is easier to maintain
- •Benefits: run history, incremental sync, scheduling, notifications, DB loading
•
Save data in Parquet format by default — unless user requests another format or CSV/JSON fits better
•
Prefer datasets over ad-hoc scripts — one dataset.yaml replaces dozens of shell commands

Quick Start Checklist

Before any data collection task:

bash

# 1. Check CLI is available
anysite --version
# If not found: source .venv/bin/activate or pip install anysite-cli

# 2. Update schema cache (required for endpoint discovery)
anysite schema update

# 3. Verify API key
anysite config get api_key
# If not set: anysite config set api_key sk-xxxxx

Endpoint Discovery

ALWAYS discover endpoints before writing API calls or dataset configs:

bash

anysite describe                          # List all endpoints
anysite describe --search "company"       # Search by keyword
anysite describe /api/linkedin/company    # Full details: input params + output fields

Prerequisites

bash

pip install "anysite-cli[data]"       # DuckDB + PyArrow for dataset commands
pip install "anysite-cli[llm]"        # LLM analysis (openai/anthropic)
pip install "anysite-cli[postgres]"   # PostgreSQL adapter

anysite config set api_key sk-xxxxx   # Configure API key
anysite schema update                  # Update schema cache
anysite llm setup                      # Configure LLM provider (interactive)
anysite db add pg --type postgres --host localhost --database mydb --user app --password-env PGPASS

Single API Call

bash

anysite api /api/linkedin/user user=satyanadella
anysite api /api/linkedin/company company=anthropic --format table
anysite api /api/linkedin/search/users keywords="CTO" count=50 --format csv --output ctos.csv
anysite api /api/linkedin/user user=satyanadella --fields "name,headline,urn.value" -q | jq

URN/Name Parameter Formats

Parameters like location, current_companies, industry accept two formats:

bash

# Single name (text search) — resolves to URNs automatically
location="London"
current_companies="Microsoft"

# Multiple URNs (direct) — use JSON array in single quotes
'location=["urn:li:geo:101165590", "urn:li:geo:101282230"]'
'current_companies=["urn:li:company:1035", "urn:li:company:1441"]'

Note: List of names ["Microsoft", "Google"] is NOT supported — use either one name OR multiple URNs.

Batch Processing

bash

anysite api /api/linkedin/user --from-file users.txt --input-key user \
  --parallel 5 --rate-limit "10/s" --on-error skip --progress --stats

Dataset Pipeline (Multi-Source Collection)

For complex data collection with dependencies, LLM enrichment, scheduling — use dataset pipelines.

Initialize

bash

anysite dataset init my-dataset
# Creates my-dataset/dataset.yaml with template config

Five Source Types

•Independent — single API call with static params
•from_file — batch calls iterating over input file values
•Dependent — batch calls using values extracted from a parent source
•Union (type: union) — combine records from multiple parent sources into one
•LLM (type: llm) — process parent data through LLM without API calls

Comprehensive Dataset YAML Reference

yaml

name: my-dataset                    # Dataset name (required)
description: Optional description   # Human-readable description

sources:
  # === TYPE 1: Independent source (single API call) ===
  - id: search_results              # Unique identifier (required)
    endpoint: /api/linkedin/search/users  # API endpoint (required for type: api)
    params:                         # Static API parameters
      keywords: "software engineer"
      count: 50
    parallel: 1                     # Concurrent requests: 1-10 (default: 1)
    rate_limit: "10/s"              # Rate limit: "N/s", "N/m", "N/h"
    on_error: stop                  # Error handling: stop | skip (default: stop)

  - id: search_extra                # Another search (can be combined with union)
    endpoint: /api/linkedin/search/users
    params: { keywords: "data engineer", count: 50 }

  # === TYPE 2: from_file source (batch from file) ===
  - id: companies
    endpoint: /api/linkedin/company
    from_file: companies.txt        # Input file: .txt (line per value), .csv, .jsonl
    file_field: company_slug        # CSV column name (for CSV files only)
    input_key: company              # API parameter to fill with each value
    parallel: 3

  # === TYPE 3: Dependent source (values from parent) ===
  - id: employees
    endpoint: /api/linkedin/company/employees
    dependency:
      from_source: companies        # Parent source ID (required)
      field: urn.value              # Dot-notation path to extract from parent records
      match_by: name                # Alternative: fuzzy match instead of exact field
      dedupe: true                  # Remove duplicate values (default: false)
    input_key: companies            # API parameter for extracted values
    input_template:                 # Transform values before API call
      companies:
        - type: company
          value: "{value}"          # {value} = extracted value placeholder
      count: 5
    refresh: auto                   # Incremental behavior: auto (default) | always

  # === TYPE 4: Union source (combine multiple sources) ===
  - id: all_search_results
    type: union                     # Source type: api (default) | union | llm
    sources: [search_results, search_extra]  # Parent source IDs to combine (required)
    dedupe_by: urn.value            # Optional: field path for deduplication (dot-notation)
    # NOTE: type: union cannot have endpoint, dependency, from_file, input_key, params
    # NOTE: all sources in the list must have the same endpoint (same data structure)
    # Records are annotated with _union_source = parent source ID

  # === TYPE 5: LLM source (process parent data without API) ===
  - id: employees_analyzed
    type: llm                       # Source type: api (default) | union | llm
    dependency:
      from_source: employees
      field: name                   # Required by schema (not used for LLM sources)
    llm:                            # LLM enrichment steps (required for type: llm)
      - type: classify              # Step types: classify | enrich | summarize | generate
        categories: "developer,recruiter,executive"  # Comma-separated (omit for auto-detect)
        output_column: role_type    # Output column name (default: category)
        fields: [headline]          # Record fields to include in LLM prompt
      - type: enrich
        add:                        # Field specs (required for enrich)
          - "sentiment:positive/negative/neutral"   # Enum: value1/value2/value3
          - "language:string"                       # Types: string | number | integer | boolean
          - "quality_score:1-10"                    # Range: min-max
        fields: [headline, summary]
        temperature: 0.0            # LLM temperature: 0.0-1.0 (default: 0.0)
        provider: openai            # Provider override: openai | anthropic
        model: gpt-4o-mini          # Model override
      - type: summarize
        max_length: 50              # Max words (default: 100)
        output_column: bio
      - type: generate
        prompt: "Write pitch for {name}"  # Template with {field} placeholders (required)
        output_column: pitch
        temperature: 0.7            # Higher for creative text
    export:                         # Export destinations (runs after Parquet write)
      - type: file
        path: ./output/{{source}}-{{date}}.csv  # Templates: {{date}}, {{source}}, {{dataset}}
        format: csv                 # Format: json | jsonl | csv
    # NOTE: type: llm cannot have endpoint, from_file, input_key, params

  # === OPTIONAL BLOCKS (any source type) ===
  - id: profiles
    endpoint: /api/linkedin/user
    dependency: { from_source: employees, field: internal_id.value }
    input_key: user
    transform:                      # Transform for exports only (Parquet keeps all)
      filter: '.follower_count > 100 and .location != ""'  # Safe expression
      fields:                       # Field selection with dot-notation aliases
        - name
        - urn.value AS urn_id
        - headline
      add_columns:                  # Static columns to inject
        batch: "q1-2026"
    export:
      - type: file
        path: ./output/profiles.csv
        format: csv
      - type: webhook
        url: https://example.com/hook
        headers: { X-Token: abc }
    db_load:                        # Database loading config
      table: people                 # Custom table name (default: source ID)
      key: urn.value                # Unique key for diff-based incremental sync
      sync: full                    # Sync mode: full (default) | append (no DELETE)
      fields:                       # Fields to load (default: all except _input_value)
        - name
        - urn.value AS urn_id
        - headline
      exclude: [_input_value, raw_html]  # Fields to skip

storage:
  format: parquet                   # Storage format (only parquet supported)
  path: ./data/                     # Storage directory (relative to dataset.yaml)

schedule:
  cron: "0 9 * * *"                 # Cron expression for scheduling

notifications:
  on_complete:
    - url: "https://hooks.slack.com/xxx"
      headers: { Authorization: "Bearer token" }
  on_failure:
    - url: "https://alerts.example.com/fail"

type: union Source Details

The type: union source combines records from multiple parent sources:

•Requires: non-empty sources list (parent source IDs)
•Optional: dedupe_by field path for removing duplicates (supports dot-notation)
•Cannot have: endpoint, dependency, from_file, input_key, params
•Validation: all parent sources must have the same endpoint (same data structure)
•Use case: merge multiple search results before a single dependent source processes them

yaml

sources:
  - id: search_cto
    endpoint: /api/linkedin/search/users
    params: { keywords: "CTO fintech", count: 50 }

  - id: search_vp
    endpoint: /api/linkedin/search/users
    params: { keywords: "VP Engineering", count: 50 }

  # Union combines all search results
  - id: all_candidates
    type: union
    sources: [search_cto, search_vp]
    dedupe_by: urn.value              # Remove duplicates by URN

  # Single dependent source processes all candidates
  - id: profiles
    endpoint: /api/linkedin/user
    dependency:
      from_source: all_candidates
      field: urn.value
    input_key: user

Records from union sources are annotated with _union_source (the parent source ID they came from).

type: llm Source Details

The type: llm source processes existing parent data without making API calls:

•Requires: dependency (parent source) and non-empty llm list
•Cannot have: endpoint, from_file, input_key, params
•Use case: Run LLM enrichment on already-collected data, re-analyze with different prompts

bash

# Collect only the LLM source (reads parent Parquet, applies LLM steps)
anysite dataset collect dataset.yaml --source employees_analyzed

Collect Commands

bash

anysite dataset collect dataset.yaml --dry-run          # Preview plan
anysite dataset collect dataset.yaml                     # Run collection
anysite dataset collect dataset.yaml --load-db pg       # Collect + auto-load to DB
anysite dataset collect dataset.yaml --incremental      # Skip already-collected inputs
anysite dataset collect dataset.yaml --source employees # Single source + dependencies
anysite dataset collect dataset.yaml --no-llm           # Skip LLM enrichment steps

Incremental Collection (Resume Where You Left Off)

For from_file and dependency sources, anysite tracks collected input values in metadata.json. This enables resuming collection without re-fetching existing data.

Workflow:

bash

# First run — collects all inputs
anysite dataset collect dataset.yaml

# Later: add new items to input file, run with --incremental
anysite dataset collect dataset.yaml --incremental
# → Only new items are collected, existing ones skipped

# Force full re-collection
anysite dataset reset-cursor dataset.yaml
anysite dataset collect dataset.yaml

Per-source refresh option:

yaml

sources:
  - id: profiles
    refresh: auto      # (default) respects --incremental

  - id: posts
    refresh: always    # always re-collects, ignores --incremental
                       # use for time-sensitive data (feeds, activity)

Reset cursor:

bash

anysite dataset reset-cursor dataset.yaml                  # all sources
anysite dataset reset-cursor dataset.yaml --source posts   # specific source

Query with DuckDB

bash

anysite dataset query dataset.yaml --sql "SELECT * FROM companies LIMIT 10"
anysite dataset query dataset.yaml --source profiles --fields "name, urn.value AS id"
anysite dataset query dataset.yaml --interactive        # SQL shell
anysite dataset stats dataset.yaml --source companies
anysite dataset profile dataset.yaml

Load into Database

bash

anysite dataset load-db dataset.yaml -c pg --drop-existing  # Full load
anysite dataset load-db dataset.yaml -c pg                   # Incremental sync (when db_load.key set)
anysite dataset load-db dataset.yaml -c pg --snapshot 2026-01-15

Compare Snapshots

bash

anysite dataset diff dataset.yaml --source profiles --key urn.value
anysite dataset diff dataset.yaml --source profiles --key urn.value --from 2026-01-30 --to 2026-02-01

Scheduling & History

bash

anysite dataset schedule dataset.yaml --incremental --load-db pg  # Generate cron entry
anysite dataset history my-dataset
anysite dataset logs my-dataset --run 42
anysite dataset reset-cursor dataset.yaml                         # Clear incremental state

Database Operations

bash

# Connection management
anysite db add pg --type postgres --host localhost --database mydb --user app --password-env PGPASS
anysite db add local --type sqlite --path ./data.db
anysite db list
anysite db test pg

# Data operations
cat data.jsonl | anysite db insert pg --table users --stdin --auto-create
anysite db query pg --sql "SELECT * FROM users" --format table

# Pipe API output to database
anysite api /api/linkedin/user user=satyanadella -q --format jsonl \
  | anysite db insert pg --table profiles --stdin --auto-create

LLM Analysis Commands

Analyze collected dataset records using LLM. Requires anysite llm setup first.

bash

anysite llm summarize dataset.yaml --source profiles --fields "name,headline" --format table
anysite llm classify dataset.yaml --source posts --categories "positive,negative,neutral"
anysite llm enrich dataset.yaml --source profiles \
  --add "seniority:junior/mid/senior" --add "is_technical:boolean"
anysite llm generate dataset.yaml --source profiles \
  --prompt "Write intro for {name} who works as {headline}" --temperature 0.7
anysite llm match dataset.yaml --source-a profiles --source-b companies --top-k 3
anysite llm deduplicate dataset.yaml --source profiles --key name --threshold 0.8
anysite llm cache-stats
anysite llm cache-clear

Use anysite describe --search <keyword> for more endpoints.

Key Patterns

•Output formats: --format json|jsonl|csv|table
•Field selection: --fields "name,headline,urn.value" (dot-notation for nested)
•Error handling: --on-error stop|skip|retry
•Config priority: CLI args > ENV vars > ~/.anysite/config.yaml > defaults

References

For detailed option tables and advanced configuration:

•api-reference.md — all CLI options
•llm-reference.md — LLM provider config, cache, prompts
•dataset-guide.md — full YAML schema, advanced features