mcp-vector-search Configuration Guide

Comprehensive reference for writing .mcp-vector-search/config.edn configuration files.

Overview

mcp-vector-search indexes documents using semantic embeddings and provides a search tool via the Model Context Protocol. Configuration controls which files are indexed, how they're processed, and what metadata is extracted.

Configuration File Location

Create a configuration file at one of these locations (first found is used):

•Project-specific: .mcp-vector-search/config.edn (in project root)
•Global: ~/.mcp-vector-search/config.edn (user home directory)

The server reads the configuration at startup and indexes all specified sources.

Basic Configuration Structure

clojure

{:description "Custom search tool description"  ; optional
 :watch? true                                   ; optional, enable file watching
 :sources [{:path "/docs/**/*.md"
            :name "Documentation"               ; optional
            :ingest :whole-document             ; optional, defaults to :whole-document
            :watch? true                        ; optional, overrides global :watch?
            :custom-key "custom-value"}]}       ; any additional keys become metadata

Top-level keys:

•:description - Custom description for the search tool (optional)
•:watch? - Enable automatic re-indexing when files change (optional, default: false)
•:sources - Array of source configurations (required)

Path Specifications

Filesystem vs Classpath Sources

Filesystem sources (:path key):

•Use absolute paths with leading /
•Support file watching for automatic re-indexing
•Files are read from the filesystem

clojure

{:path "/docs/**/*.md"}

Classpath sources (:class-path key):

•Use relative paths without leading /
•File watching not available (read-only resources)
•Resources discovered from classpath (JARs, resource directories)
•Useful when embedding mcp-vector-search as a library

clojure

{:class-path "docs/**/*.md"}

Important: Sources must specify exactly one of :path or :class-path.

Glob Patterns

Single-level glob (*):

•Matches any characters within a single directory level
•Does not match path separators

clojure

{:path "/docs/*.md"}           ; matches /docs/README.md
                                ; does NOT match /docs/api/guide.md

Recursive glob (**):

•Matches any characters across multiple directory levels
•Includes path separators

clojure

{:path "/docs/**/*.md"}         ; matches /docs/README.md
                                ; matches /docs/api/guide.md

Named Captures

Extract metadata from file paths using named regex groups:

clojure

{:path "/docs/(?<category>[^/]+)/*.md"}

Syntax: (?<name>pattern)

•name - Metadata key (converted to keyword)
•pattern - Java regular expression

For file /docs/api/functions.md:

•Captures: {:category "api"}

Multiple captures example:

clojure

{:path "/(?<project>[^/]+)/(?<version>v\\d+)/(?<file>.+\\.clj)"}

For file /myapp/v1/core.clj:

•Captures: {:project "myapp", :version "v1", :file "core.clj"}

Path Specification Examples

clojure

;; All markdown files recursively
{:path "/docs/**/*.md"}

;; Single directory level
{:path "/docs/*.md"}

;; Capture directory name
{:path "/docs/(?<category>[^/]+)/*.md"}

;; Multiple captures
{:path "/(?<project>[^/]+)/(?<version>[^/]+)/**/*.clj"}

;; Literal file
{:path "/docs/README.md"}

;; Classpath resource (no leading /)
{:class-path "docs/**/*.md"}

Ingest Pipeline Strategies

Ingest strategies control how documents are processed, embedded, and stored. Set via the :ingest key.

:whole-document (default)

Embeds and stores the entire file content as a single segment.

clojure

{:sources [{:path "/docs/**/*.md"
            :ingest :whole-document}]}

Characteristics:

•One segment per file
•Both embedding and storage use full content
•Simple and straightforward for most use cases

Use when: You want to search across complete documents and return full content.

:namespace-doc

For Clojure source files - embeds only the namespace docstring but stores the full file content.

clojure

{:sources [{:path "/src/**/*.clj"
            :ingest :namespace-doc}]}

Requirements:

•File must contain a valid ns form
•Namespace must have a docstring
•Adds :namespace to metadata (e.g., {:namespace "my.app.core"})

Characteristics:

•Embedding uses only namespace docstring
•Storage includes full file content
•One segment per file

Use when: You want to search Clojure namespaces by their documentation while still returning the complete source code.

:file-path

Embeds the full content but stores only the file path.

clojure

{:sources [{:path "/docs/**/*.md"
            :ingest :file-path}]}

Characteristics:

•Embedding uses full file content
•Storage contains only the file path
•One segment per file
•Reduces memory footprint for large document sets

Use when:

•You only need to discover which files match a query
•You want to reduce memory usage
•Your client will read file content separately

:code-analysis

Analyzes Clojure and Java source files using clj-kondo to extract code elements (vars, namespaces, classes, methods, fields, macros). Creates one searchable segment per code element.

clojure

{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis}]}

Configuration options:

clojure

{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :visibility :public-only        ; :all (default) | :public-only
            :element-types #{:var :macro}}]} ; optional filter

:visibility - Controls which elements to include:

•:all (default) - Include all elements regardless of visibility
•
:public-only - Include only public elements
- •Clojure: Excludes vars with ^:private or {:private true} metadata
- •Java: Excludes members with private or protected access modifiers

:element-types (optional) - Set of element types to include:

•Valid types: :var, :macro, :namespace, :class, :method, :field, :constructor
•If omitted: Include all element types
•If provided: Only include specified types

Characteristics:

•Multiple segments per file (one per code element)
•Embedding uses docstring if present, otherwise element name
•Content stores complete clj-kondo analysis map as EDN string
•Supports both Clojure (.clj, .cljs, .cljc) and Java (.java) files

Segment metadata:

•:element-type - Type of code element (var, macro, namespace, class, method, field, constructor)
•:element-name - Qualified name (e.g., "my.ns/my-fn" or "com.example.MyClass.myMethod")
•:language - Source language (clojure or java)
•:namespace - Containing namespace (Clojure) or package (Java)
•:visibility - Access level (public, private, or protected)

Use when: You want to search code by documentation or API discovery, finding functions/classes/methods based on their purpose rather than file names.

Examples:

clojure

;; Search all code elements
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis}]}

;; Search only public API
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :visibility :public-only}]}

;; Search only vars and macros
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :element-types #{:var :macro}}]}

;; Java source code analysis
{:sources [{:path "/src/**/*.java"
            :ingest :code-analysis
            :visibility :public-only}]}

:chunked

Splits documents into smaller segments using LangChain4j's recursive text splitter. Enables better semantic search for large documents.

clojure

{:sources [{:path "/docs/**/*.md"
            :ingest :chunked
            :chunk-size 512
            :chunk-overlap 100}]}

Configuration:

•:chunk-size - Maximum characters per chunk (default: 512)
•:chunk-overlap - Characters to overlap between chunks (default: 100)

Note: LangChain4j's recursive paragraph splitter prioritizes semantic boundaries (paragraph breaks) over exact overlap amounts. Adjacent chunks may have less overlap than specified if splitting at a paragraph boundary.

Characteristics:

•Multiple segments per file
•Each chunk is embedded and stored independently
•All chunks from the same file share the same :doc-id for batch removal during updates
•Chunk metadata includes: :chunk-index (position), :chunk-count (total chunks), :chunk-offset (character offset)

Chunk sizing guidance:

•Smaller chunks (256-512 chars): Better for precise fact-based retrieval
•Larger chunks (1024+ chars): Better for broader context
•Overlap (10-20%): Recommended to preserve context at chunk boundaries

Use when: You have large documents and need precise fact-based retrieval where specific information may be buried in lengthy content.

Examples:

clojure

;; Fine-grained retrieval for technical docs
{:sources [{:path "/docs/**/*.md"
            :ingest :chunked
            :chunk-size 384
            :chunk-overlap 75}]}

;; Broader context for narrative content
{:sources [{:path "/articles/**/*.md"
            :ingest :chunked
            :chunk-size 1024
            :chunk-overlap 200}]}

;; Compare strategies for different content types
{:sources [
  ;; Small reference docs - whole document works well
  {:path "/api-reference/**/*.md"
   :ingest :whole-document}

  ;; Large guides - chunking improves precision
  {:path "/guides/**/*.md"
   :ingest :chunked
   :chunk-size 512
   :chunk-overlap 100}]}

Metadata System

Metadata comes from two sources:

•Base metadata: Any additional keys in the source map (except :path, :class-path, :name, :ingest, :watch?)
•Captures: Values extracted from named groups in the path spec

clojure

{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
            :project "my-project"
            :type "documentation"}]}

For a file /docs/api/functions.md:

•Metadata: {:project "my-project", :type "documentation", :category "api"}

The :name key, if provided, is also added to metadata.

System-added metadata:

•:doc-id - File path (used for watch updates/deletes)
•:file-id - File path
•:segment-id - Unique segment identifier

Strategy-specific metadata:

•:namespace-doc adds: :namespace
•:code-analysis adds: :element-type, :element-name, :language, :namespace, :visibility
•:chunked adds: :chunk-index, :chunk-count, :chunk-offset

File Watching

Optional file watching system for automatic re-indexing when files change.

Configuration:

•Global :watch? true enables watching for all sources
•Per-source :watch? true/false overrides global setting
•Only available for filesystem sources (:path), not classpath sources

clojure

{:watch? true  ; enable globally
 :sources [
   {:path "/docs/**/*.md"}         ; watched (global setting)
   {:path "/src/**/*.clj"
    :watch? false}                  ; not watched (override)
   {:path "/notes/**/*.txt"
    :watch? true}]}                 ; watched (explicit)

Behavior:

•Events are debounced (500ms) to avoid excessive re-indexing
•File created → index new file
•File modified → remove old embeddings by :doc-id, re-index
•File deleted → remove embeddings by :doc-id
•Recursive watching for directories with ** glob

Complete Examples

Basic Documentation Search

clojure

{:sources [{:path "/Users/me/docs/**/*.md"}]}

Multi-Source with Metadata

clojure

{:description "Project documentation and code search"
 :sources [
   {:path "/Users/me/project/docs/**/*.md"
    :name "Documentation"
    :type "docs"}

   {:path "/Users/me/project/src/**/*.clj"
    :ingest :namespace-doc
    :name "Source Code"
    :type "code"}]}

Metadata Extraction with Captures

clojure

{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
            :project "myapp"
            :type "documentation"}]}

Code Analysis with Filtering

clojure

{:sources [
   {:path "/src/**/*.clj"
    :ingest :code-analysis
    :visibility :public-only
    :element-types #{:var :macro}}

   {:path "/src/**/*.java"
    :ingest :code-analysis
    :visibility :public-only}]}

Chunked Large Documents

clojure

{:sources [
   {:path "/guides/**/*.md"
    :ingest :chunked
    :chunk-size 512
    :chunk-overlap 100}]}

Mixed Filesystem and Classpath

clojure

{:sources [
   ;; Filesystem documentation
   {:path "/Users/me/docs/**/*.md"
    :source "local"}

   ;; Bundled library documentation from classpath
   {:class-path "lib-docs/**/*.md"
    :source "library"}

   ;; Clojure source from classpath
   {:class-path "my/app/**/*.clj"
    :ingest :namespace-doc
    :source "library-code"}]}

File Watching Enabled

clojure

{:watch? true
 :sources [
   {:path "/Users/me/project/docs/**/*.md"}
   {:path "/Users/me/project/src/**/*.clj"
    :ingest :namespace-doc}]}

Complex Multi-Strategy Configuration

clojure

{:description "Comprehensive project search"
 :watch? true
 :sources [
   ;; API reference - small docs, keep whole
   {:path "/docs/api/**/*.md"
    :ingest :whole-document
    :category "api-reference"}

   ;; User guides - large docs, chunk them
   {:path "/docs/guides/**/*.md"
    :ingest :chunked
    :chunk-size 512
    :chunk-overlap 100
    :category "guides"}

   ;; Public API code
   {:path "/src/(?<namespace>[^/]+)/**/*.clj"
    :ingest :code-analysis
    :visibility :public-only
    :category "code"}

   ;; README files - whole document
   {:path "/(?<project>[^/]+)/README.md"
    :ingest :whole-document
    :category "readme"}]}

Tips and Best Practices

Path specifications:

•Use absolute paths for filesystem sources (start with /)
•Use relative paths for classpath sources (no leading /)
•Named captures are powerful for extracting structured metadata
•Test path patterns with a small subset first

Ingest strategies:

•Start with :whole-document for most use cases
•Use :namespace-doc for Clojure codebases to search by documentation
•Use :code-analysis when you need fine-grained API discovery
•Use :chunked for large documents (>1000 chars)
•Use :file-path when you need to minimize memory usage

Metadata:

•Add meaningful metadata to enable filtering during search
•Use consistent naming conventions for metadata keys
•Captures are great for hierarchical organization (project, version, category)

File watching:

•Enable globally with :watch? true for development
•Disable for production or when working with static content
•Override per-source as needed
•Only works with filesystem sources, not classpath

Performance:

•Smaller chunk sizes create more segments (more memory, more precise search)
•Larger chunk sizes create fewer segments (less memory, broader context)
•:file-path strategy significantly reduces memory usage
•Consider the trade-off between search precision and resource usage