AgentSkillsCN

using-spacy-nlp

使用spaCy 3.x进行工业级NLP,用于文本处理与自定义分类器训练。在“安装spaCy”“选择NLP模型”(en_core_web_sm/md/lg/trf)、“分词”、“POS标注”、“命名实体识别”(NER)、“依存句法分析”、“训练TextCategorizer模型”、“排查spaCy错误”(E050/E941模型错误、E927版本不匹配、内存问题)、“使用nlp.pipe进行批量处理”,或“将NLP模型部署至生产环境”时使用。包含数据准备脚本、配置模板,以及FastAPI服务示例。

SKILL.md
--- frontmatter
name: using-spacy-nlp
description: Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.

spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Contents


Scope

In Scope:

  • spaCy 3.x installation and text processing
  • TextCategorizer training for document classification
  • Production deployment and optimization patterns

Out of Scope (use other tools/skills):

  • Training custom NER models (different workflow)
  • spaCy 2.x (deprecated, incompatible with 3.x)
  • Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
  • Custom tokenizers or language models

Quick Start

python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

bash
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

ModelSizeSpeedUse Case
en_core_web_sm12 MBFastestPrototyping, speed-critical
en_core_web_md40 MBFastGeneral use with word vectors
en_core_web_lg560 MBFastSemantic similarity tasks
en_core_web_trf438 MBSlowMaximum accuracy (GPU)

Verify Installation

python
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md


Text Processing

Basic Pipeline

python
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

python
for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

python
# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

python
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md


Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

  1. Prepare data → Run scripts/prepare_training_data.py
  2. Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
  3. Validatepython -m spacy debug data config.cfg (catches issues before training)
  4. Trainpython -m spacy train config.cfg --output ./output
  5. Evaluate → Run scripts/evaluate_model.py
  6. Usenlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

json
[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

bash
python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

bash
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

python
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md


Troubleshooting

Model Not Found (E050)

code
OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

bash
python -m spacy download en_core_web_sm

Alternative (avoids path issues):

python
import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

python
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

python
import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

bash
python -m spacy validate

For more troubleshooting: See references/troubleshooting.md


Production Deployment

Package Model

bash
python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

bash
python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

python
from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

TechniqueSpeedupWhen to Use
Disable components2-3xDon't need all annotations
nlp.pipe()5-10xProcessing multiple texts
Multiprocessing2-4xCPU-bound, many cores
GPU2-5xTransformer models

For evaluation metrics and hyperparameter tuning: See references/production.md


Scripts Reference

ScriptPurposeUsage
prepare_training_data.pyConvert JSON to DocBinpython scripts/prepare_training_data.py --input data.json
generate_config.pyCreate training configpython scripts/generate_config.py --categories "A,B,C"
evaluate_model.pyDetailed metricspython scripts/evaluate_model.py --model ./output/model-best
serve_model.pyFastAPI serverpython scripts/serve_model.py --model ./model --port 8000

Assets Reference

AssetPurposeUsage
config_textcat.cfgBase training configCopy and customize for your labels
training_data_template.jsonData format exampleReference for preparing your data