Hierarchical Taxonomy Clustering

Create a unified multi-level taxonomy from hierarchical category paths by clustering similar paths and automatically generating meaningful category names.

Problem

Given category paths from multiple sources (e.g., "electronics -> computers -> laptops"), create a unified taxonomy that groups similar paths across sources, generates meaningful category names, and produces a clean N-level hierarchy (typically 5 levels). The unified category taxonomy could be used to do analysis or metric tracking on products from different platform.

Methodology

•Hierarchical Weighting: Convert paths to embeddings with exponentially decaying weights (Level i gets weight 0.6^(i-1)) to signify the importance of category granularity
•Recursive Clustering: Hierarchically cluster at each level (10-20 clusters at L1, 3-20 at L2-L5) using cosine distance
•Intelligent Naming: Generate category names via weighted word frequency + lemmatization + bundle word logic
•Quality Control: Exclude all ancestor words (parent, grandparent, etc.), avoid ancestor path duplicates, clean special characters

Output

DataFrame with added columns:

•unified_level_1: Top-level category (e.g., "electronic | device")
•unified_level_2: Second-level category (e.g., "computer | laptop")
•unified_level_3 through unified_level_N: Deeper levels

Category names use | separator, max 5 words, covering 70%+ of records in each cluster.

Installation

bash

pip install pandas numpy scipy sentence-transformers nltk tqdm
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

4-Step Pipeline

Step 1: Load, Standardize, Filter and Merge (`step1_preprocessing_and_merge.py`)

•Input: List of (DataFrame, source_name) tuples, each of the with category_path column
•Process: Per-source deduplication, text cleaning (remove &/,/'/-/quotes,'and' or "&", "," and so on, lemmatize words as nouns), normalize delimiter to >, depth filtering, prefix removal, then merge all sources. source_level should reflect the processed version of the source level name
•Output: Merged DataFrame with category_path, source, depth, source_level_1 through source_level_N

Step 2: Weighted Embeddings (`step2_weighted_embedding_generation.py`)

•Input: DataFrame from Step 1
•Output: Numpy embedding matrix (n_records × 384)
•Weights: L1=1.0, L2=0.6, L3=0.36, L4=0.216, L5=0.1296 (exponential decay 0.6^(n-1))
•Performance: For ~10,000 records, expect 2-5 minutes. Progress bar will show encoding status.

Step 3: Recursive Clustering (`step3_recursive_clustering_naming.py`)

•Input: DataFrame + embeddings from Step 2
•Output: Assignments dict {index → {level_1: ..., level_5: ...}}
•Average linkage + cosine distance, 10-20 clusters at L1, 3-20 at L2-L5
•Word-based naming: weighted frequency + lemmatization + coverage ≥70%
•Performance: For ~10,000 records, expect 1-3 minutes for hierarchical clustering and naming. Be patient - the system is working through recursive levels.

Step 4: Export Results (`step4_result_assignments.py`)

•Input: DataFrame + assignments from Step 3
•
Output:
- •unified_taxonomy_full.csv - all records with unified categories
- •unified_taxonomy_hierarchy.csv - unique taxonomy structure

Usage

Use scripts/pipeline.py to run the complete 4-step workflow.

See scripts/pipeline.py for:

•Complete implementation of all 4 steps
•Example code for processing multiple sources
•Command-line interface
•Individual step usage (for advanced control)