Clio-Style Clustering Pipeline
Build an end-to-end semantic clustering analysis from any text data source, with interactive visualization.
What This Skill Does
This skill guides you through building a complete clustering pipeline:
- •Data Sourcing - Identify APIs/methods to fetch data, build tests to verify access
- •Scraping - Collect data with proper pagination and rate limiting
- •Embedding - Generate embeddings using OpenAI's text-embedding-3-large
- •Clustering - Hierarchical HDBSCAN clustering with UMAP projection
- •Labeling - LLM-powered cluster naming and description
- •Visualization - Interactive React/D3 explorer with drill-down
Quick Start
When the user describes a data source (e.g., "GitHub issues from facebook/react"), follow these steps:
Phase 1: Data Source Discovery
First, identify how to access the data:
- •Research the API - Use web search to find official API documentation
- •Identify authentication - What tokens/keys are needed?
- •Find pagination patterns - How does the API handle large datasets?
- •Determine rate limits - What are the constraints?
See data-sourcing.md for common patterns (GitHub, Slack, etc.)
Phase 2: Build & Test Data Fetcher
IMPORTANT: Write tests BEFORE building the full scraper.
# test_fetcher.py - Verify API access works
import os
import requests
def test_api_access():
"""Verify we can access the API."""
# Adapt this for your specific data source
token = os.environ.get('API_TOKEN')
assert token, "API_TOKEN not set"
response = requests.get(
'https://api.example.com/endpoint',
headers={'Authorization': f'Bearer {token}'}
)
assert response.status_code == 200
data = response.json()
assert len(data) > 0, "No data returned"
print(f"Successfully fetched {len(data)} items")
if __name__ == '__main__':
test_api_access()
Run the test: python test_fetcher.py
Only proceed to the full scraper once tests pass.
Phase 3: Build the Scraper
Create a scraper that:
- •Handles pagination efficiently
- •Respects rate limits
- •Stores data in SQLite for resumability
- •Saves progress for resumable scraping
See data-sourcing.md for the database schema and scraper template.
Phase 4: Generate Embeddings & Cluster
Use the clustering pipeline to:
- •Generate embeddings with OpenAI
- •Run hierarchical HDBSCAN clustering
- •Project to 2D with UMAP
- •Label clusters with LLM
See clustering-reference.md for the complete implementation.
Phase 5: Build Visualization
Set up the interactive visualization:
- •Export data to JSON
- •Create Next.js app with D3 visualization
- •Add hierarchical drill-down view
See visualization-setup.md for setup instructions.
The components/ directory contains ready-to-copy React components.
Project Structure
When complete, the project should look like:
project/ ├── data/ │ └── items.db # SQLite database ├── pipeline/ │ ├── __init__.py │ ├── db.py # Database operations │ ├── scraper.py # Data fetcher │ ├── embed.py # Embedding generation │ ├── cluster.py # HDBSCAN clustering │ ├── describe.py # LLM labeling │ └── export.py # JSON export ├── visualizer/ │ ├── app/ │ │ ├── page.tsx │ │ └── layout.tsx │ ├── components/ │ │ ├── HierarchicalView.tsx │ │ ├── ScatterPlot.tsx │ │ └── ... │ ├── lib/ │ │ ├── types.ts │ │ ├── data.ts │ │ └── utils.ts │ └── public/data/ │ ├── items.json │ └── clusters.json ├── test_fetcher.py # API access tests ├── requirements.txt └── README.md
Dependencies
Python (for pipeline)
openai>=1.0 instructor>=1.0 hdbscan>=0.8.33 umap-learn>=0.5 scikit-learn>=1.3 numpy>=1.24 rich>=13.0
Node.js (for visualization)
{
"dependencies": {
"next": "14.2.0",
"react": "^18.2.0",
"d3": "^7.8.5",
"framer-motion": "^11.0.0",
"tailwindcss": "^3.4.1"
}
}
Environment Variables
OPENAI_API_KEY=sk-... # Required for embeddings and labeling # Plus whatever auth your data source needs: GITHUB_TOKEN=ghp_... # For GitHub SLACK_TOKEN=xoxb-... # For Slack # etc.
Running the Pipeline
# 1. Test API access python test_fetcher.py # 2. Scrape data python -m pipeline.scraper # 3. Generate embeddings python -m pipeline.embed # 4. Cluster python -m pipeline.cluster # 5. Label clusters with LLM python -m pipeline.describe # 6. Export for visualization python -m pipeline.export # 7. Run visualizer cd visualizer && npm run dev
Key Design Decisions
- •SQLite for storage - Simple, portable, supports resumability
- •HDBSCAN over K-means - Finds natural clusters, handles noise
- •3-level hierarchy - Coarse (L1) -> Medium (L2) -> Fine (L3)
- •UMAP for projection - Preserves local structure better than t-SNE
- •text-embedding-3-large - Best quality embeddings for semantic similarity
- •Next.js + D3 - Fast, interactive visualization with SSR support
Detailed Documentation
- •Data Sourcing Patterns - API patterns, auth, pagination
- •Clustering Implementation - Embedding, HDBSCAN, UMAP code
- •Visualization Setup - Next.js app and components