Skill: entity-quality-gate
Goal
Filter out non-company entities (article titles, process names, marketplace listings) before they enter CRM.
Problem Solved
Precision search and web scraping often extract:
- •Article/news headlines ("Sustainable Heat-setting Process")
- •Marketplace listings ("Monforts Stenter Machine Product on Alibaba")
- •Generic terms ("Manufacturer", "Textile Finishing")
- •Academic papers ("Research on Dyeing Methods")
These pollute the lead database and inflate metrics.
Inputs
- •Raw leads from any collector
- •
config/entity_blacklist.yaml- disallowed patterns/domains
Outputs
- •Filtered leads with
entity_qualityfield (A/B/C/REJECT) - •Rejected entities logged to
logs/entity_rejected.log
Quality Levels
| Grade | Criteria | Action |
|---|---|---|
| A | Company suffix (GmbH/Ltd/SA/Inc) + Website found | CRM ready |
| B | 2+ words + Domain match OR directory source | Needs verification |
| C | No suffix, no website, but from fair/GOTS | Low confidence |
| REJECT | Single word, generic term, marketplace, academic | Drop |
Implementation
Rejection Rules (immediate drop)
- •Single word + generic term: "Manufacturer", "Textile", "Finishing", "Stand"
- •Marketplace domains: alibaba.com, indiamart.com, made-in-china.com, globalsources.com
- •Academic domains: sciencedirect.com, researchgate.net, academia.edu, springer.com
- •News/blog patterns: "... announces", "... reveals", "How to...", "What is..."
- •Process/technology phrases: Contains "process", "method", "technology" without company name
Upgrade Rules (boost quality)
- •Company suffix present → Grade A candidate
- •Official website found → +1 grade
- •From official directory (GOTS/OEKO-TEX/association) → +1 grade
- •Has contact info (email/phone) → +1 grade
Verification Checklist
- • No rejected entities in final CRM output
- • Entity quality distribution logged
- • Rejection reasons stored for audit
Dependencies
- •
src/processors/entity_quality_gate.py- Main implementation - •
config/entity_blacklist.yaml- Patterns and domains to reject