Semantik Plugin Development
This skill helps you create plugins for Semantik, a self-hosted semantic search engine. Plugins extend Semantik's capabilities for document ingestion, embedding, chunking, reranking, extraction, and AI agents.
Protocol Version
Current Version: 1.0.0
Breaking changes to protocols increment the major version. Your plugins continue to work as long as they satisfy the protocol interface.
Security Note
Plugins run in-process with the main Semantik application (no sandboxing). Only install plugins you trust. See Security Guide for details.
Quick Start
Create a minimal connector plugin in 5 minutes:
# my_connector.py
from typing import ClassVar, Any, AsyncIterator
import hashlib
class MyConnector:
PLUGIN_ID: ClassVar[str] = "my-connector"
PLUGIN_TYPE: ClassVar[str] = "connector"
PLUGIN_VERSION: ClassVar[str] = "1.0.0"
def __init__(self, config: dict[str, Any]) -> None:
self._config = config
async def authenticate(self) -> bool:
return True
async def load_documents(self, source_id: int | None = None) -> AsyncIterator[dict[str, Any]]:
content = "Document content..."
yield {
"content": content,
"unique_id": "doc-1",
"source_type": self.PLUGIN_ID,
"metadata": {},
"content_hash": hashlib.sha256(content.encode()).hexdigest(),
}
@classmethod
def get_config_fields(cls) -> list[dict[str, Any]]:
return []
@classmethod
def get_secret_fields(cls) -> list[dict[str, Any]]:
return []
@classmethod
def get_manifest(cls) -> dict[str, Any]:
return {"id": cls.PLUGIN_ID, "type": cls.PLUGIN_TYPE, "version": cls.PLUGIN_VERSION,
"display_name": "My Connector", "description": "Custom connector"}
Plugin Types
| Type | Purpose | Key Method | Template |
|---|---|---|---|
connector | Ingest documents from sources | load_documents() | connector.py |
embedding | Convert text to vectors | embed_texts() | embedding.py |
chunking | Split documents into chunks | chunk() | chunking.py |
reranker | Reorder search results | rerank() | reranker.py |
extractor | Extract entities/metadata | extract() | extractor.py |
agent | LLM-powered capabilities | execute() | agent.py |
Type-specific guides:
- •Connector Guide - Document sources, async iterators
- •Embedding Guide - Query/document modes, dimensions
- •Chunking Guide - Text segmentation strategies
- •Reranker Guide - Cross-encoder reranking
- •Extractor Guide - Entity and metadata extraction
- •Agent Guide - LLM agents, streaming, context
Cross-cutting guides:
- •Testing Guide - Contract tests, mocks, fixtures
- •Security Guide - Trust model, best practices
- •Advanced Guide - Health checks, dependencies, migration
Development Approach
Protocol-Based (Recommended)
Use plain Python classes with no semantik imports. Plugins are validated by structural typing (duck typing):
class MyPlugin:
PLUGIN_ID: ClassVar[str] = "my-plugin"
PLUGIN_TYPE: ClassVar[str] = "connector" # or embedding, chunking, etc.
PLUGIN_VERSION: ClassVar[str] = "1.0.0"
# ... implement required methods
Benefits:
- •Zero dependencies on semantik
- •Develop in separate repository
- •Distribute via PyPI or git
- •No version conflicts
ABC-Based (Advanced)
Inherit from semantik base classes when you need access to internal utilities:
from shared.connectors.base import BaseConnector
class MyConnector(BaseConnector):
# ... inherit helper methods
Use when:
- •Building embedding plugins with GPU management
- •Need access to shared utilities
- •Developing internal/builtin plugins
Required Class Variables
Every plugin must define:
from typing import ClassVar, Any
class MyPlugin:
PLUGIN_ID: ClassVar[str] = "my-plugin" # Unique ID (lowercase, hyphens)
PLUGIN_TYPE: ClassVar[str] = "connector" # One of 6 types
PLUGIN_VERSION: ClassVar[str] = "1.0.0" # Semantic version
Some plugin types require additional class variables:
| Type | Additional Variables |
|---|---|
connector | METADATA (dict with name, description, icon) |
embedding | INTERNAL_NAME, API_ID, PROVIDER_TYPE, METADATA |
chunking | (none) |
reranker | (none) |
extractor | (none) |
agent | (none) |
Manifest Method
All plugins must implement get_manifest():
@classmethod
def get_manifest(cls) -> dict[str, Any]:
return {
"id": cls.PLUGIN_ID,
"type": cls.PLUGIN_TYPE,
"version": cls.PLUGIN_VERSION,
"display_name": "My Plugin",
"description": "What the plugin does",
# Optional fields:
"author": "Your Name",
"license": "MIT",
"homepage": "https://github.com/...",
"requires": ["other-plugin"], # Dependencies
"capabilities": {}, # Plugin-specific capabilities
}
Configuration
Config Fields (UI)
Define configuration fields for the Semantik UI:
@classmethod
def get_config_fields(cls) -> list[dict[str, Any]]:
return [
{
"name": "base_url",
"type": "text", # text, password, number, boolean, select
"label": "Base URL",
"description": "API endpoint",
"required": True,
"placeholder": "https://api.example.com",
},
{
"name": "model",
"type": "select",
"label": "Model",
"options": ["model-a", "model-b"],
"default": "model-a",
},
]
Secret Fields
Mark fields that contain secrets (encrypted at rest):
@classmethod
def get_secret_fields(cls) -> list[dict[str, Any]]:
return [
{"name": "api_key", "label": "API Key", "required": True},
]
Environment Variables
Use the _env suffix pattern for secrets:
# In config schema - user enters env var name
"api_key_env": "OPENAI_API_KEY"
# At runtime, semantik resolves it
config = {"api_key": "sk-actual-key-value"} # Resolved
Testing
Manual Verification
pip install -e .
python -c "
from my_plugin import MyConnector
print(f'ID: {MyConnector.PLUGIN_ID}')
print(f'Type: {MyConnector.PLUGIN_TYPE}')
print(f'Manifest: {MyConnector.get_manifest()}')
"
Protocol Validation
import pytest
class TestMyPlugin:
def test_has_required_attributes(self):
assert hasattr(MyPlugin, "PLUGIN_ID")
assert hasattr(MyPlugin, "PLUGIN_TYPE")
assert hasattr(MyPlugin, "PLUGIN_VERSION")
assert MyPlugin.PLUGIN_TYPE == "connector"
def test_manifest_format(self):
manifest = MyPlugin.get_manifest()
assert "id" in manifest
assert "type" in manifest
assert "display_name" in manifest
@pytest.mark.asyncio
async def test_core_functionality(self):
plugin = MyPlugin(config={})
# Test plugin-specific methods
With Semantik Test Mixins
If semantik is installed:
from shared.plugins.testing.contracts import ConnectorProtocolTestMixin
class TestMyConnector(ConnectorProtocolTestMixin):
plugin_class = MyConnector
Packaging
pyproject.toml
[project] name = "semantik-plugin-myconnector" version = "1.0.0" requires-python = ">=3.10" dependencies = [] # Your dependencies only [project.entry-points."semantik.plugins"] my-connector = "my_plugin.connector:MyConnector" [build-system] requires = ["hatchling"] build-backend = "hatchling.build"
See templates/pyproject.toml for a complete template.
Entry Point Format
plugin-id = "module.path:ClassName"
- •
plugin-id: Should matchPLUGIN_ID - •
module.path: Python import path - •
ClassName: Your plugin class
Installation
# Development
pip install -e .
# From git
pip install git+https://github.com/you/semantik-plugin-myconnector.git
# Via Semantik API
POST /api/v2/plugins/install
{"install_command": "git+https://github.com/..."}
Common Issues
Plugin Not Loading
- •
Check entry point is registered:
bashpip show semantik-plugin-myconnector
- •
Verify PLUGIN_TYPE is valid:
pythonassert PLUGIN_TYPE in ["connector", "embedding", "chunking", "reranker", "extractor", "agent"]
- •
Check for import errors:
pythontry: from my_plugin import MyConnector except ImportError as e: print(f"Error: {e}")
Validation Errors
| Error | Fix |
|---|---|
missing required keys: {'content'} | Add all required fields to returned dict |
Invalid role: 'xyz' | Use valid string from MESSAGE_ROLES |
content_hash must be 64 characters | Use hashlib.sha256(text.encode()).hexdigest() |
Async Issues
All I/O methods must be async:
# Wrong
def load_documents(self):
yield {"content": "..."}
# Right
async def load_documents(self) -> AsyncIterator[dict]:
yield {"content": "..."}
Templates
Ready-to-use templates in templates/:
| File | Description |
|---|---|
connector.py | Document source connector |
embedding.py | Embedding model provider |
chunking.py | Text chunking strategy |
reranker.py | Search result reranker |
extractor.py | Entity/metadata extractor |
agent.py | LLM-powered agent |
pyproject.toml | Package configuration |
Copy a template and modify:
cp templates/connector.py my_connector.py # Edit PLUGIN_ID, PLUGIN_VERSION, and implement methods
Data Format Reference
Connector Documents (IngestedDocumentDict)
{
"content": str, # Full text (required)
"unique_id": str, # Unique identifier (required)
"source_type": str, # Your PLUGIN_ID (required)
"metadata": dict, # Source metadata (required)
"content_hash": str, # SHA-256, 64 hex chars (required)
"file_path": str | None, # Local path (optional)
}
Chunk Format (ChunkDict)
{
"content": str, # Chunk text (required)
"metadata": { # Chunk metadata (required)
"chunk_index": int,
"start_offset": int,
"end_offset": int,
},
"chunk_id": str | None, # Unique ID (optional)
"embedding": list[float] | None, # Pre-computed (optional)
}
Rerank Result (RerankResultDict)
{
"index": int, # Original document index (required)
"score": float, # Relevance score (required)
"text": str | None, # Document text (optional)
"metadata": dict | None, # Metadata (optional)
}
Agent Message (AgentMessageDict)
{
"id": str, # Unique ID (required)
"role": str, # user, assistant, system, tool_call, tool_result, error
"type": str, # text, thinking, tool_use, tool_output, partial, final, error
"content": str, # Message content (required)
"timestamp": str, # ISO 8601 (required)
"is_partial": bool, # Streaming partial (optional)
"sequence_number": int, # Message order (optional)
}
Getting Help
- •Semantik docs: See
semantik/docs/external-plugins.mdfor protocol details - •Protocol reference: See
semantik/docs/plugin-protocols.mdfor full specifications - •Examples: Check
semantik/packages/shared/plugins/builtins/for built-in plugins