Document Clustering Skill
Analyze a repository of documents to group them based on content similarity, topic, or purpose. This skill helps organize large collections, identify redundancies, and discover relationships.
Inputs
- •
PATH- The repository to analyze (e.g., "/repository") - •
SIMILARITY_THRESHOLD- (Optional) Float (0.0-1.0), threshold for grouping (default: 0.8) - •
VISUALIZATION- (Optional) Boolean, whether to generate a visual representation (default: false)
Workflow
Step 1: Text Processing
Ingest documents from PATH.
- •Normalize text (remove stop words, stemming/lemmatization).
- •Generate embeddings or TF-IDF vectors for each document.
Step 2: Clustering Analysis
Apply clustering algorithms (e.g., K-Means, DBSCAN) to the document vectors.
- •Group documents that meet the
SIMILARITY_THRESHOLD. - •Identify outliers or unique documents.
Step 3: Cluster Labeling
Analyze the centroid or representative terms of each cluster to assign a meaningful label (Topic).
Step 4: Output Generation
Generate the clustering report.
- •If
VISUALIZATIONis true, create a scatter plot or dendrogram data.
Required Outputs
A CLUSTERING_REPORT object containing:
- •Cluster List: ID, Label, and List of Documents in each cluster.
- •Redundancy Report: Sets of highly similar documents (potential duplicates).
- •Visualization Data: (If requested) Coordinates for plotting.
Quick Reference
- •Purpose: Organize unstructured content and find duplicates.
- •Techniques: Text Mining, NLP, Vector Space Models.