Hierarchical Taxonomy Clustering
Create a unified multi-level taxonomy from hierarchical category paths by clustering similar paths and automatically generating meaningful category names.
Problem
Given category paths from multiple sources (e.g., "electronics -> computers -> laptops"), create a unified taxonomy that groups similar paths across sources, generates meaningful category names, and produces a clean N-level hierarchy (typically 5 levels). The unified category taxonomy could be used to do analysis or metric tracking on products from different platform.
Methodology
- •Hierarchical Weighting: Convert paths to embeddings with exponentially decaying weights (Level i gets weight 0.6^(i-1)) to signify the importance of category granularity
- •Recursive Clustering: Hierarchically cluster at each level (10-20 clusters at L1, 3-20 at L2-L5) using cosine distance
- •Intelligent Naming: Generate category names via weighted word frequency + lemmatization + bundle word logic
- •Quality Control: Exclude all ancestor words (parent, grandparent, etc.), avoid ancestor path duplicates, clean special characters
Output
DataFrame with added columns:
- •
unified_level_1: Top-level category (e.g., "electronic | device") - •
unified_level_2: Second-level category (e.g., "computer | laptop") - •
unified_level_3throughunified_level_N: Deeper levels
Category names use | separator, max 5 words, covering 70%+ of records in each cluster.
Installation
pip install pandas numpy scipy sentence-transformers nltk tqdm
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
4-Step Pipeline
Step 1: Load, Standardize, Filter and Merge (step1_preprocessing_and_merge.py)
- •Input: List of (DataFrame, source_name) tuples, each of the with
category_pathcolumn - •Process: Per-source deduplication, text cleaning (remove &/,/'/-/quotes,'and' or "&", "," and so on, lemmatize words as nouns), normalize delimiter to
>, depth filtering, prefix removal, then merge all sources. source_level should reflect the processed version of the source level name - •Output: Merged DataFrame with
category_path,source,depth,source_level_1throughsource_level_N
Step 2: Weighted Embeddings (step2_weighted_embedding_generation.py)
- •Input: DataFrame from Step 1
- •Output: Numpy embedding matrix (n_records × 384)
- •Weights: L1=1.0, L2=0.6, L3=0.36, L4=0.216, L5=0.1296 (exponential decay 0.6^(n-1))
- •Performance: For ~10,000 records, expect 2-5 minutes. Progress bar will show encoding status.
Step 3: Recursive Clustering (step3_recursive_clustering_naming.py)
- •Input: DataFrame + embeddings from Step 2
- •Output: Assignments dict {index → {level_1: ..., level_5: ...}}
- •Average linkage + cosine distance, 10-20 clusters at L1, 3-20 at L2-L5
- •Word-based naming: weighted frequency + lemmatization + coverage ≥70%
- •Performance: For ~10,000 records, expect 1-3 minutes for hierarchical clustering and naming. Be patient - the system is working through recursive levels.
Step 4: Export Results (step4_result_assignments.py)
- •Input: DataFrame + assignments from Step 3
- •Output:
- •
unified_taxonomy_full.csv- all records with unified categories - •
unified_taxonomy_hierarchy.csv- unique taxonomy structure
- •
Usage
Use scripts/pipeline.py to run the complete 4-step workflow.
See scripts/pipeline.py for:
- •Complete implementation of all 4 steps
- •Example code for processing multiple sources
- •Command-line interface
- •Individual step usage (for advanced control)