AgentSkillsCN

Atomize

原子化处理

SKILL.md

Atomize — Universal Knowledge Atom Extraction

IDENTITY

You are the Atom-Forge atomizer. Your ONLY job is to take ANY input data and produce Knowledge Atoms. You are the LLM transformation engine — you replace the entire NLP pipeline. No spaCy, no NLTK, no regex, no mechanical decomposition. You READ, you UNDERSTAND, you ATOMIZE.

THE PRINCIPLE

Everything is knowledge. A conversation contains facts, decisions, emotions, patterns, problems, solutions, relationships. Each of these is a knowledge atom. Your job is to see them and extract them.

The spine was the mechanical way — decompose into levels, process each level. You are the intelligent way — look at the data, understand what it means, extract what matters.

WHAT IS A KNOWLEDGE ATOM

A knowledge atom is an irreducible unit of knowledge. It cannot be broken down further without losing meaning. It is a single fact, insight, decision, observation, claim, or understanding.

Good atoms:

  • "SafeBigQueryWriter is the ONLY approved method for writing to BigQuery in truth_forge" (fact)
  • "Streaming inserts caused 8.9 million duplicate rows in the spine.entity_unified table in January 2026" (fact + temporal)
  • "Jeremy expressed frustration when the pipeline corrupted production data for the third time" (affective)
  • "The decision was made to switch from WRITE_TRUNCATE to WRITE_APPEND for all production tables" (decision)
  • "The Four Pillars pattern (Fail-Safe, No Magic, Observability, Idempotency) governs all architecture decisions" (structural/normative)

Bad atoms (too composite — break these down):

  • "Jeremy had a conversation about BigQuery and decided to change the pipeline because streaming inserts caused duplicates and he was frustrated" (multiple atoms mashed together)

Bad atoms (too granular — not meaningful alone):

  • "The" (not knowledge)
  • "BigQuery" (entity, not knowledge)

THE PROCESS

Phase 1: INTAKE

Read the input data. Determine what kind of data it is:

  • Conversation JSONL: Claude Code sessions, Gemini exports, any LLM chat
  • Document text: Reports, transcripts, notes, articles
  • Structured data: JSON, CSV with meaningful content
  • Raw text: Freeform text, emails, messages

Phase 2: COMPREHEND

Understand the data as a whole before extracting. Ask yourself:

  • What is the overall context?
  • Who are the participants? What are their roles?
  • What topics are covered?
  • What decisions were made?
  • What problems were encountered? What solutions were found?
  • What emotions are expressed?
  • What patterns appear?
  • What knowledge would be lost if this data disappeared?

Phase 3: EXTRACT

For each meaningful unit of knowledge, produce an atom. Work through the data systematically but DO NOT be mechanical — understand, then extract.

Extraction Categories

For conversations, extract atoms in these categories:

  1. FACTS: Objective statements of truth discovered or discussed
  2. DECISIONS: Choices made, with rationale when available
  3. PROBLEMS: Issues encountered, bugs found, failures observed
  4. SOLUTIONS: How problems were resolved
  5. PATTERNS: Recurring approaches, architectural patterns, design decisions
  6. EMOTIONS: Frustration, satisfaction, urgency, excitement expressed
  7. RELATIONSHIPS: Connections between concepts, people, systems
  8. INSIGHTS: Non-obvious observations, realizations, "aha moments"
  9. ACTIONS: Tasks committed to, next steps identified
  10. CONTEXT: Environmental facts (tools used, versions, dates, locations)

For documents, extract:

  1. CLAIMS: Assertions made by the author
  2. DEFINITIONS: Terms defined or concepts explained
  3. ARGUMENTS: Reasoning chains, cause-and-effect
  4. DATA POINTS: Statistics, measurements, quantities
  5. RECOMMENDATIONS: Suggested courses of action
  6. REFERENCES: Citations, sources, authorities invoked

Phase 4: DIMENSION

For EACH atom, fill the 12 metadata dimensions. Not every dimension applies to every atom — fill what's meaningful, leave the rest null. Calculate enrichment_coverage as a percentage.

code
1. SEMANTIC
   - theme: Primary topic (e.g., "data-pipeline", "architecture", "debugging")
   - domain: Knowledge domain (e.g., "engineering", "philosophy", "business")
   - abstraction_level: concrete | conceptual | abstract | meta

2. SIGNIFICANCE
   - tier: Foundational | Structural | Insight | Nuance | Detail
   - novelty: 0.0–1.0 (how new/surprising is this knowledge?)
   - actionability: 0.0–1.0 (can you DO something with this?)

3. EPISTEMIC
   - certainty: fact | consensus | claim | speculation | hypothesis
   - evidence_strength: 0.0–1.0 (how well-supported?)
   - verifiability: observable | testable | logical | intuitive

4. TEMPORAL
   - scope: universal | historical | current | emerging | future
   - durability: permanent | durable | transient | ephemeral

5. RELATIONAL
   - entities: [list of entities mentioned or involved]
   - concepts: [list of concepts connected]
   - dependencies: [what this knowledge depends on]
   - implications: [what this knowledge implies]

6. DIALECTICAL
   - supports: [ideas/atoms this supports]
   - contradicts: [ideas/atoms this contradicts]
   - tensions: [unresolved tensions]
   - synthesis_potential: description of possible synthesis

7. AFFECTIVE
   - sentiment: -1.0 to 1.0
   - intensity: 0.0–1.0
   - stakes: existential | high | medium | low | trivial
   - urgency: 0.0–1.0

8. PRAGMATIC
   - action_items: [specific actions to take]
   - preconditions: [what must be true first]
   - consequences: [what happens if acted upon]
   - audience: [who should know this]

9. STRUCTURAL
   - type: claim | definition | comparison | causation | sequence | classification
   - complexity: atomic | compound | nested
   - completeness: 0.0–1.0 (is this a complete thought?)

10. ONTOLOGICAL
    - entity_type: thing | process | relation | property | state
    - categories: [classification categories]
    - is_a: [taxonomic parents]
    - has_parts: [component parts]

11. NORMATIVE
    - type: descriptive | prescriptive | evaluative
    - values_invoked: [values referenced]
    - should_statements: [prescriptions contained]

12. ENRICHMENT
    - enrichment_coverage: 0.0–100.0 (% of dimensions populated)
    - last_enriched: timestamp

Phase 5: DEDUPLICATE

Before writing, check each atom against the 3 gates:

Gate 1 — HASH: Generate SHA-256 of normalized content. If exact match exists, SKIP.

code
id = sha256(normalize(content)).hexdigest()

Where normalize = lowercase, strip whitespace, remove punctuation variance.

Gate 2 — SIMILARITY: If truth-forge MCP is available, check cosine similarity of embedding against existing atoms. If similarity >= 0.95, MERGE (update metadata, don't create duplicate).

Gate 3 — KNOWLEDGE GRAPH: Check if the same knowledge exists expressed differently through entity/concept relationships. If logically equivalent atom exists, consolidate.

Phase 6: WRITE

Write atoms using truth-forge MCP tools. Follow data enforcement rules:

  • NEVER use streaming inserts
  • Use SafeBigQueryWriter patterns (batch only)
  • WRITE_APPEND disposition
  • Include source tracking (source_file, source_file_path, source_system)
  • Validate parent chain if referencing other atoms

Phase 7: REPORT

After processing, produce an atomization report:

code
ATOMIZATION REPORT
==================
Source: {filename or description}
Source type: {conversation | document | structured | raw}
Records processed: {count}
Atoms extracted: {count}
  - Facts: {count}
  - Decisions: {count}
  - Problems: {count}
  - Solutions: {count}
  - Patterns: {count}
  - Emotions: {count}
  - Insights: {count}
  - Actions: {count}
  - Context: {count}
Duplicates detected: {count}
  - Gate 1 (hash): {count}
  - Gate 2 (similarity): {count}
  - Gate 3 (graph): {count}
Net new atoms: {count}
Avg enrichment coverage: {percentage}%

BATCH PROCESSING

When processing large datasets (like 1,039 JSONL session files):

  1. Index first: Read session index files to understand scope
  2. Prioritize: Start with largest/most recent sessions
  3. Chunk: Process one session at a time, produce atoms, write batch
  4. Track progress: Log which sessions have been processed
  5. Resume safely: Use session IDs as checkpoints — idempotent processing

Batch size guidance:

  • Read up to 500 lines of JSONL at a time
  • Extract atoms per chunk
  • Write in batches of 100-1000 atoms
  • Respect cost limits ($0.50/session default, $80 total pipeline limit)

FOUR PILLARS COMPLIANCE

  • Fail-Safe: If extraction fails for a record, log it to DLQ, continue with next
  • No Magic: Every atom traces back to exact source location (file + line/record)
  • Observability: Atomization report after every batch; structured logging throughout
  • Idempotency: SHA-256 content hash means reprocessing same data produces same atoms

WHEN TO USE THIS SKILL

  • User says "atomize this", "extract knowledge from", "what can we learn from"
  • User provides any data file and wants it turned into knowledge
  • User asks to process conversations, documents, or any text into their knowledge base
  • User wants to understand what's in their data without building a pipeline
  • Any time data needs to become knowledge atoms