AgentSkillsCN

opensearch

通过 MCP 工具高效查询 OpenSearch 日志,同时最大限度减少上下文消耗的指南

SKILL.md
--- frontmatter
name: opensearch
description: Guide for efficiently querying OpenSearch logs via MCP tools with minimal context consumption

OpenSearch MCP Usage Guide

You have access to an OpenSearch MCP server that queries container logs via OpenSearch Dashboards. The cluster has 6.6B+ documents. Always include time ranges to avoid timeouts.

Available Tools

ToolPurpose
opensearch_searchSearch logs with Lucene/KQL query syntax (primary tool)
opensearch_search_rawRaw Query DSL for advanced queries
opensearch_aggregateAggregations (counts, terms, histograms)
opensearch_get_indicesList indices with doc counts
opensearch_get_mappingsGet field names/types from a sample doc
opensearch_cluster_healthBasic cluster health
opensearch_switch_clusterSwitch to a different cluster on-the-fly (no restart needed)
opensearch_get_active_clusterShow currently active cluster name, URL, and cookie age

Context Optimization Strategy (CRITICAL)

The MCP server applies operations to reduce response size. Always optimize for minimal context usage.

Step 1: Start with summary_only to get counts

code
opensearch_search(index="container-logs-*", query_string="...", summary_only=true)

This returns only total_hits and time_range — costs ~100 tokens.

Step 2: If you need actual logs, use field filtering

code
opensearch_search(
  index="container-logs-*",
  query_string="...",
  fields=["@timestamp", "log", "kubernetes.namespace_name", "kubernetes.pod_name"],
  size=10
)

Only returns specified fields — saves 70-80% context vs full documents.

Step 3: For high-volume analysis, use aggregations instead of fetching docs

code
opensearch_aggregate(
  index="container-logs-*",
  aggs={"by_namespace": {"terms": {"field": "kubernetes.namespace_name.keyword", "size": 20}}},
  query={"bool": {"must": [...], "filter": [{"range": {"@timestamp": {"gte": "now-1h"}}}]}}
)

Key Parameters for opensearch_search

ParameterDefaultPurpose
summary_onlyfalseSet true to get only hit count, no documents
auto_prunetrueStrips kubernetes.labels and kubernetes.annotations automatically
fieldsnullArray of specific fields to return (e.g., ["log", "@timestamp"])
max_chars_per_hit2000Truncates individual hits exceeding this size
size100Number of docs to return (max 1000)
time_fromnow-15mStart time (ISO 8601 or relative like now-1h)
time_tonowEnd time

Reading _meta Flags

Every response includes a _meta.applied_operations array showing what the server did:

FlagMeaning
summary_onlyOnly counts returned, no documents
field_filter:field1,field2Only these fields were returned
auto_prune:kubernetes.labels,kubernetes.annotationsVerbose k8s fields were removed
hits_truncated:N/MN out of M hits exceeded max_chars_per_hit and were truncated
partial_results:100_of_50000Only 100 of 50000 total hits returned
response_truncated_at_15KBEntire response exceeded 15KB and was cut off

When you see response_truncated_at_15KB:

  1. Reduce size (e.g., size=5)
  2. Use fields to select only needed fields
  3. Use summary_only=true if you only need counts
  4. Use opensearch_aggregate for analysis instead

When you see partial_results:

The query matched more documents than returned. If the user needs broader analysis, use aggregations.

When you see hits_truncated:

Individual log entries were too large. Use fields to pick only the fields you need, or increase max_chars_per_hit.

CRITICAL: Searching the log Field (Cluster-Specific Strategy)

The log field search strategy depends on the cluster type:

OnPrem Clusters (dev-onprem-, stg-onprem-, prod-onprem-*)

For onprem clusters, use query_string with analyze_wildcard: true and quoted wildcard patterns:

Search errors in logs (OnPrem)

json
opensearch_search_raw(
  index="container-logs-*",
  body={
    "query": {"bool": {"must": [
      {"query_string": {
        "query": "log:\"*level*error*\"",
        "analyze_wildcard": true,
        "time_zone": "Asia/Colombo"
      }}
    ], "filter": [
      {"range": {"@timestamp": {"gte": "now-5m", "lte": "now"}}}
    ]}},
    "size": 20,
    "_source": ["@timestamp", "log", "kubernetes.namespace_name", "kubernetes.pod_name"]
  }
)

Search by trace/request ID (OnPrem)

json
opensearch_search_raw(
  index="container-logs-*",
  body={
    "query": {"bool": {"must": [
      {"query_string": {
        "query": "log:\"*1d4867ac-65cb-4de8-8d46-aaef62f6b5fb*\"",
        "analyze_wildcard": true,
        "time_zone": "Asia/Colombo"
      }}
    ], "filter": [
      {"range": {"@timestamp": {"gte": "now-1h", "lte": "now"}}}
    ]}},
    "size": 100,
    "sort": [{"@timestamp": "asc"}],
    "_source": ["@timestamp", "log", "kubernetes.namespace_name", "kubernetes.pod_name", "kubernetes.container_name"]
  }
)

Key points for OnPrem:

  • Use query_string with analyze_wildcard: true
  • Wrap the pattern in double quotes: "*pattern*" not *pattern*
  • Include time_zone: "Asia/Colombo" for consistency with dashboard
  • This approach matches what OpenSearch Dashboards UI does

Cloud Clusters (AWS/Azure: dev-aws-, prod-azure-, stg-azure-*)

For cloud clusters (AWS/Azure), the log field is mapped as keyword (not analyzed text). Use wildcard queries:

Search errors in logs (Cloud)

json
opensearch_search_raw(
  index="container-logs-*",
  body={
    "query": {"bool": {"must": [
      {"range": {"@timestamp": {"gte": "now-5m", "lte": "now"}}},
      {"wildcard": {"log": "*level*error*"}}
    ]}},
    "size": 20,
    "_source": ["@timestamp", "log", "kubernetes.namespace_name", "kubernetes.pod_name"]
  }
)

Search by trace/request ID (Cloud)

json
opensearch_search_raw(
  index="container-logs-*",
  body={
    "query": {"bool": {"must": [
      {"range": {"@timestamp": {"gte": "now-1h", "lte": "now"}}},
      {"wildcard": {"log": "*77e71a17-2e52-404a-86d2-eed997fd2a57*"}}
    ]}},
    "size": 20,
    "_source": ["@timestamp", "log", "kubernetes.namespace_name", "kubernetes.pod_name"]
  }
)

Key points for Cloud:

  • Use wildcard query (NOT query_string)
  • No quotes needed around the pattern
  • query_string with log:*pattern* returns 0 hits on these clusters

Aggregate error logs by namespace (Works on both)

json
opensearch_aggregate(
  index="container-logs-*",
  query={"bool": {"must": [
    {"range": {"@timestamp": {"gte": "now-5m", "lte": "now"}}},
    {"wildcard": {"log": "*level*error*"}}
  ]}},
  aggs={"namespaces": {"terms": {"field": "kubernetes.namespace_name", "size": 10}}}
)

How to Determine Cluster Type

Check the active cluster name using opensearch_get_active_cluster:

  • If cluster name contains onprem → use query_string with analyze_wildcard: true
  • Otherwise (AWS/Azure) → use wildcard queries

Note: query_string and opensearch_search work fine for non-keyword fields like kubernetes.namespace_name, stream, kubernetes.pod_name, etc. on all clusters.

Common Query Patterns (non-log fields — use opensearch_search)

Search by namespace

code
query_string: 'kubernetes.namespace_name:"my-namespace"'

Search by pod name

code
query_string: 'kubernetes.pod_name:"my-pod-abc123"'

Search stderr logs

code
query_string: 'stream:stderr'

Combine namespace filter with log content search (use opensearch_search_raw)

json
opensearch_search_raw(
  index="container-logs-*",
  body={
    "query": {"bool": {"must": [
      {"range": {"@timestamp": {"gte": "now-5m", "lte": "now"}}},
      {"term": {"kubernetes.namespace_name": "my-namespace"}},
      {"wildcard": {"log": "*timeout*"}}
    ]}},
    "size": 20,
    "_source": ["@timestamp", "log", "kubernetes.pod_name"]
  }
)

Useful Fields for fields Parameter

FieldDescription
@timestampLog timestamp
logThe actual log message
streamstdout or stderr
kubernetes.namespace_nameK8s namespace
kubernetes.pod_nameK8s pod name
kubernetes.container_nameContainer name
kubernetes.hostNode name
kubernetes.pod_ipPod IP address
kubernetes.labels.organization_idOrg ID (when auto_prune=false or use aggs)
kubernetes.labels.env_nameEnvironment name
kubernetes.labels.component_nameComponent name

Time Handling

  • The cluster stores timestamps in UTC
  • The user is in IST (UTC+5:30) — convert accordingly
  • IST 10:45 AM = UTC 05:15 AM
  • Use relative times when possible: now-5m, now-1h, now-24h

Cluster Map

IMPORTANT: When the user mentions a cluster, first read opensearch-mcp/clusters.py to get the available clusters and their short names. The cluster registry is defined in the CLUSTERS dictionary in that file.

Example Cluster Format

Users configure their clusters in opensearch-mcp/clusters.py. Example entries:

python
CLUSTERS = {
    "dev-aws-eu-cluster": ("https://opensearch-dashboard.dev.example.com", "Dev AWS EU Cluster"),
    "prod-aws-cluster": ("https://opensearch-dashboard.prod.example.com", "Prod AWS Cluster"),
}

Common aliases

When the user says any of these, map to the corresponding cluster:

  • "prod" / "production" → prod-azure-us-cdp (the main production cluster)
  • "dev" / "development" → dev-azure-us-cdp
  • "stg" / "staging" → stg-azure-us-cdp
  • "dev onprem" → dev-onprem-cp or dev-onprem-dp (ask which)
  • "prod eu" → prod-azure-eu-cdp
  • "dev eu" → dev-azure-eu-cdp
  • "wdt" → prod-wdt-userprod
  • "cutndry" / "cnd" → prod-cutndry
  • "bcentral" → prod-bcentral
  • "devry" → prod-devry

If the user asks to query a cluster that has No OpenSearch, inform them it uses Azure Log Analytics Workspace instead and is not queryable through this MCP.

Switching Clusters

When the user wants to query a different cluster, use the opensearch_switch_cluster tool:

code
opensearch_switch_cluster(cluster="prod-azure-eu-cdp")

This automatically fetches cookies via headless SSO and switches all subsequent queries to the new cluster. No restart needed.

If the tool returns an error (SSO session expired), instruct the user to run:

code
cd /Users/wso2/Desktop/WSO2/RnD/opensearch-agent/opensearch-mcp
./get-cookies.py <cluster-short-name>

This opens a browser for manual login. After login, retry — no restart needed.

Use opensearch_get_active_cluster to check which cluster is currently active before switching.

Cookie Management and 401 Handling

The MCP server has automatic cookie refresh. Here's how it works:

Auto-refresh (transparent to you)

  1. When a request gets 401, the server automatically launches a headless Playwright browser
  2. It uses the cached Azure AD SSO session to get fresh cookies
  3. Saves them to cookies.json and retries the request
  4. You (Claude) never see the 401 — it's handled internally

Cluster switching also refreshes cookies

When you call opensearch_switch_cluster, it fetches fresh cookies for the target cluster via headless SSO. If that fails, the tool returns an error with manual instructions.

When auto-refresh fails

If the SSO session itself has expired (user hasn't logged in via browser recently), auto-refresh fails. The server returns a structured error with action_required and a command to run.

When you see this error, instruct the user:

code
The OpenSearch cookies have expired and automatic refresh failed (SSO session expired).
To fix, run:

  cd /Users/wso2/Desktop/WSO2/RnD/opensearch-agent/opensearch-mcp
  ./get-cookies.py <cluster-name>

This opens a browser for you to log in. After login completes, cookies are
saved automatically. **No Claude Code restart needed** — just retry the query.

Available clusters: ./get-cookies.py --list

Important: After the user runs the script, you CAN retry the query immediately — no restart needed. The server reads cookies.json fresh on every request.

Cost-Conscious Query Plan

For any user request, follow this order:

  1. Count firstsummary_only=true to understand volume
  2. Sample if largesize=5, fields=[...] to understand shape
  3. Aggregate if analytical — use opensearch_aggregate for breakdowns
  4. Full fetch only when needed — small result sets with field filtering