AgentSkillsCN

semantik-plugin-development

创建 Semantik 插件(连接器、嵌入式模型、分块器、重排序器、提取器、代理)。适用于开发插件、创建新集成,或探讨插件模式、协议与测试方法时使用。

SKILL.md
--- frontmatter
name: semantik-plugin-development
description: Create semantik plugins (connectors, embeddings, chunkers, rerankers, extractors, agents). Use when developing plugins, creating new integrations, or asking about plugin patterns, protocols, or testing.
metadata:
  short-description: Develop semantik plugins

Semantik Plugin Development

This skill helps you create plugins for Semantik, a self-hosted semantic search engine. Plugins extend Semantik's capabilities for document ingestion, embedding, chunking, reranking, extraction, and AI agents.

Protocol Version

Current Version: 1.0.0

Breaking changes to protocols increment the major version. Your plugins continue to work as long as they satisfy the protocol interface.

Security Note

Plugins run in-process with the main Semantik application (no sandboxing). Only install plugins you trust. See Security Guide for details.

Quick Start

Create a minimal connector plugin in 5 minutes:

python
# my_connector.py
from typing import ClassVar, Any, AsyncIterator
import hashlib

class MyConnector:
    PLUGIN_ID: ClassVar[str] = "my-connector"
    PLUGIN_TYPE: ClassVar[str] = "connector"
    PLUGIN_VERSION: ClassVar[str] = "1.0.0"

    def __init__(self, config: dict[str, Any]) -> None:
        self._config = config

    async def authenticate(self) -> bool:
        return True

    async def load_documents(self, source_id: int | None = None) -> AsyncIterator[dict[str, Any]]:
        content = "Document content..."
        yield {
            "content": content,
            "unique_id": "doc-1",
            "source_type": self.PLUGIN_ID,
            "metadata": {},
            "content_hash": hashlib.sha256(content.encode()).hexdigest(),
        }

    @classmethod
    def get_config_fields(cls) -> list[dict[str, Any]]:
        return []

    @classmethod
    def get_secret_fields(cls) -> list[dict[str, Any]]:
        return []

    @classmethod
    def get_manifest(cls) -> dict[str, Any]:
        return {"id": cls.PLUGIN_ID, "type": cls.PLUGIN_TYPE, "version": cls.PLUGIN_VERSION,
                "display_name": "My Connector", "description": "Custom connector"}

Plugin Types

TypePurposeKey MethodTemplate
connectorIngest documents from sourcesload_documents()connector.py
embeddingConvert text to vectorsembed_texts()embedding.py
chunkingSplit documents into chunkschunk()chunking.py
rerankerReorder search resultsrerank()reranker.py
extractorExtract entities/metadataextract()extractor.py
agentLLM-powered capabilitiesexecute()agent.py

Type-specific guides:

Cross-cutting guides:


Development Approach

Protocol-Based (Recommended)

Use plain Python classes with no semantik imports. Plugins are validated by structural typing (duck typing):

python
class MyPlugin:
    PLUGIN_ID: ClassVar[str] = "my-plugin"
    PLUGIN_TYPE: ClassVar[str] = "connector"  # or embedding, chunking, etc.
    PLUGIN_VERSION: ClassVar[str] = "1.0.0"
    # ... implement required methods

Benefits:

  • Zero dependencies on semantik
  • Develop in separate repository
  • Distribute via PyPI or git
  • No version conflicts

ABC-Based (Advanced)

Inherit from semantik base classes when you need access to internal utilities:

python
from shared.connectors.base import BaseConnector

class MyConnector(BaseConnector):
    # ... inherit helper methods

Use when:

  • Building embedding plugins with GPU management
  • Need access to shared utilities
  • Developing internal/builtin plugins

Required Class Variables

Every plugin must define:

python
from typing import ClassVar, Any

class MyPlugin:
    PLUGIN_ID: ClassVar[str] = "my-plugin"      # Unique ID (lowercase, hyphens)
    PLUGIN_TYPE: ClassVar[str] = "connector"    # One of 6 types
    PLUGIN_VERSION: ClassVar[str] = "1.0.0"     # Semantic version

Some plugin types require additional class variables:

TypeAdditional Variables
connectorMETADATA (dict with name, description, icon)
embeddingINTERNAL_NAME, API_ID, PROVIDER_TYPE, METADATA
chunking(none)
reranker(none)
extractor(none)
agent(none)

Manifest Method

All plugins must implement get_manifest():

python
@classmethod
def get_manifest(cls) -> dict[str, Any]:
    return {
        "id": cls.PLUGIN_ID,
        "type": cls.PLUGIN_TYPE,
        "version": cls.PLUGIN_VERSION,
        "display_name": "My Plugin",
        "description": "What the plugin does",
        # Optional fields:
        "author": "Your Name",
        "license": "MIT",
        "homepage": "https://github.com/...",
        "requires": ["other-plugin"],  # Dependencies
        "capabilities": {},  # Plugin-specific capabilities
    }

Configuration

Config Fields (UI)

Define configuration fields for the Semantik UI:

python
@classmethod
def get_config_fields(cls) -> list[dict[str, Any]]:
    return [
        {
            "name": "base_url",
            "type": "text",        # text, password, number, boolean, select
            "label": "Base URL",
            "description": "API endpoint",
            "required": True,
            "placeholder": "https://api.example.com",
        },
        {
            "name": "model",
            "type": "select",
            "label": "Model",
            "options": ["model-a", "model-b"],
            "default": "model-a",
        },
    ]

Secret Fields

Mark fields that contain secrets (encrypted at rest):

python
@classmethod
def get_secret_fields(cls) -> list[dict[str, Any]]:
    return [
        {"name": "api_key", "label": "API Key", "required": True},
    ]

Environment Variables

Use the _env suffix pattern for secrets:

python
# In config schema - user enters env var name
"api_key_env": "OPENAI_API_KEY"

# At runtime, semantik resolves it
config = {"api_key": "sk-actual-key-value"}  # Resolved

Testing

Manual Verification

bash
pip install -e .
python -c "
from my_plugin import MyConnector
print(f'ID: {MyConnector.PLUGIN_ID}')
print(f'Type: {MyConnector.PLUGIN_TYPE}')
print(f'Manifest: {MyConnector.get_manifest()}')
"

Protocol Validation

python
import pytest

class TestMyPlugin:
    def test_has_required_attributes(self):
        assert hasattr(MyPlugin, "PLUGIN_ID")
        assert hasattr(MyPlugin, "PLUGIN_TYPE")
        assert hasattr(MyPlugin, "PLUGIN_VERSION")
        assert MyPlugin.PLUGIN_TYPE == "connector"

    def test_manifest_format(self):
        manifest = MyPlugin.get_manifest()
        assert "id" in manifest
        assert "type" in manifest
        assert "display_name" in manifest

    @pytest.mark.asyncio
    async def test_core_functionality(self):
        plugin = MyPlugin(config={})
        # Test plugin-specific methods

With Semantik Test Mixins

If semantik is installed:

python
from shared.plugins.testing.contracts import ConnectorProtocolTestMixin

class TestMyConnector(ConnectorProtocolTestMixin):
    plugin_class = MyConnector

Packaging

pyproject.toml

toml
[project]
name = "semantik-plugin-myconnector"
version = "1.0.0"
requires-python = ">=3.10"
dependencies = []  # Your dependencies only

[project.entry-points."semantik.plugins"]
my-connector = "my_plugin.connector:MyConnector"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

See templates/pyproject.toml for a complete template.

Entry Point Format

code
plugin-id = "module.path:ClassName"
  • plugin-id: Should match PLUGIN_ID
  • module.path: Python import path
  • ClassName: Your plugin class

Installation

bash
# Development
pip install -e .

# From git
pip install git+https://github.com/you/semantik-plugin-myconnector.git

# Via Semantik API
POST /api/v2/plugins/install
{"install_command": "git+https://github.com/..."}

Common Issues

Plugin Not Loading

  1. Check entry point is registered:

    bash
    pip show semantik-plugin-myconnector
    
  2. Verify PLUGIN_TYPE is valid:

    python
    assert PLUGIN_TYPE in ["connector", "embedding", "chunking", "reranker", "extractor", "agent"]
    
  3. Check for import errors:

    python
    try:
        from my_plugin import MyConnector
    except ImportError as e:
        print(f"Error: {e}")
    

Validation Errors

ErrorFix
missing required keys: {'content'}Add all required fields to returned dict
Invalid role: 'xyz'Use valid string from MESSAGE_ROLES
content_hash must be 64 charactersUse hashlib.sha256(text.encode()).hexdigest()

Async Issues

All I/O methods must be async:

python
# Wrong
def load_documents(self):
    yield {"content": "..."}

# Right
async def load_documents(self) -> AsyncIterator[dict]:
    yield {"content": "..."}

Templates

Ready-to-use templates in templates/:

FileDescription
connector.pyDocument source connector
embedding.pyEmbedding model provider
chunking.pyText chunking strategy
reranker.pySearch result reranker
extractor.pyEntity/metadata extractor
agent.pyLLM-powered agent
pyproject.tomlPackage configuration

Copy a template and modify:

bash
cp templates/connector.py my_connector.py
# Edit PLUGIN_ID, PLUGIN_VERSION, and implement methods

Data Format Reference

Connector Documents (IngestedDocumentDict)

python
{
    "content": str,              # Full text (required)
    "unique_id": str,            # Unique identifier (required)
    "source_type": str,          # Your PLUGIN_ID (required)
    "metadata": dict,            # Source metadata (required)
    "content_hash": str,         # SHA-256, 64 hex chars (required)
    "file_path": str | None,     # Local path (optional)
}

Chunk Format (ChunkDict)

python
{
    "content": str,              # Chunk text (required)
    "metadata": {                # Chunk metadata (required)
        "chunk_index": int,
        "start_offset": int,
        "end_offset": int,
    },
    "chunk_id": str | None,      # Unique ID (optional)
    "embedding": list[float] | None,  # Pre-computed (optional)
}

Rerank Result (RerankResultDict)

python
{
    "index": int,                # Original document index (required)
    "score": float,              # Relevance score (required)
    "text": str | None,          # Document text (optional)
    "metadata": dict | None,     # Metadata (optional)
}

Agent Message (AgentMessageDict)

python
{
    "id": str,                   # Unique ID (required)
    "role": str,                 # user, assistant, system, tool_call, tool_result, error
    "type": str,                 # text, thinking, tool_use, tool_output, partial, final, error
    "content": str,              # Message content (required)
    "timestamp": str,            # ISO 8601 (required)
    "is_partial": bool,          # Streaming partial (optional)
    "sequence_number": int,      # Message order (optional)
}

Getting Help

  • Semantik docs: See semantik/docs/external-plugins.md for protocol details
  • Protocol reference: See semantik/docs/plugin-protocols.md for full specifications
  • Examples: Check semantik/packages/shared/plugins/builtins/ for built-in plugins