AgentSkillsCN

libingest

libingest——文档摄取管道。IngestPipeline 协调可配置的转换步骤。IngestStep 定义了诸如 PDF 转图片、图片转 HTML、提取上下文、标注 HTML 等独立处理器。它能够将 PDF、PowerPoint 和图片转换为经过 Schema.org 标注的 HTML。适用于文档处理、知识提取与内容转换。

SKILL.md
--- frontmatter
name: libingest
description: >
  libingest - Document ingestion pipeline. IngestPipeline orchestrates
  configurable transformation steps. IngestStep defines individual processors
  like pdf-to-images, images-to-html, extract-context, annotate-html. Converts
  PDF, PowerPoint, images to Schema.org annotated HTML. Use for document
  processing, knowledge extraction, and content transformation.

libingest Skill

When to Use

  • Converting PDF documents to structured HTML
  • Processing PowerPoint presentations for indexing
  • Extracting semantic content from images via OCR
  • Building document ingestion pipelines

Key Concepts

IngestPipeline: Orchestrates a sequence of transformation steps defined in config/ingest.yml.

IngestStep: Individual processing step (pdf-to-images, images-to-html, extract-context, annotate-html, normalize-html).

Usage Patterns

Pattern 1: Run ingestion via CLI

bash
# Drop files in data/ingest/in/
cp document.pdf data/ingest/in/

# Run pipeline
make ingest

Pattern 2: Programmatic ingestion

javascript
import { IngestPipeline } from "@copilot-ld/libingest";

const pipeline = new IngestPipeline(config, storage, llmClient);
const result = await pipeline.process("document.pdf");
// result.output points to final HTML

Integration

Configured via config/ingest.yml. Uses libllm for vision processing. Output stored in data/ingest/pipeline/.