Data Quality Standards for Somali Dialect Classifier

Quality Dimensions

1. Completeness

•All required fields present (text, label, source, timestamp)
•No null or empty text fields
•Labels properly assigned (Northern/Southern/Central)

2. Accuracy

•Text is in Somali (not English, Arabic, or other languages)
•Labels match actual dialect (validated by native speakers)
•Geographic metadata aligns with dialect labels

3. Consistency

•Uniform text encoding (UTF-8)
•Consistent label format (standardized names)
•Timestamp format standardized (ISO 8601)

4. Uniqueness

•No exact duplicates
•Near-duplicate detection (>95% similarity flagged)
•Source URL deduplication

5. Validity

•Text length within acceptable range (10-5000 characters)
•No corrupted/garbled text
•No HTML tags or formatting artifacts

Quality Metrics

Critical Metrics

Language Purity:

•Target: >98% Somali text
•Method: Language detection (langdetect, fastText)
•Action: Remove non-Somali text

Duplicate Rate:

•Target: <2% duplicates
•Method: Exact match + fuzzy matching (Levenshtein distance)
•Action: Keep first occurrence, remove duplicates

Label Confidence:

•Target: >90% inter-annotator agreement
•Method: Multiple annotators for sample
•Action: Re-label low-confidence examples

Text Quality Score:

•Target: Average score >7/10
•Components: Length, vocabulary richness, grammar
•Action: Filter texts with score <5

Validation Pipeline

Stage 1: Basic Validation

python

def basic_validation(record):
    checks = {
        'has_text': bool(record.get('text', '').strip()),
        'has_label': record.get('label') in ['Northern', 'Southern', 'Central'],
        'valid_length': 10 <= len(record.get('text', '')) <= 5000,
        'valid_encoding': is_valid_utf8(record['text'])
    }
    return all(checks.values()), checks

Stage 2: Language Detection

python

from langdetect import detect

def validate_language(text):
    try:
        lang = detect(text)
        return lang == 'so'  # Somali ISO code
    except:
        return False

Stage 3: Duplicate Detection

python

from difflib import SequenceMatcher

def is_near_duplicate(text1, text2, threshold=0.95):
    similarity = SequenceMatcher(None, text1, text2).ratio()
    return similarity >= threshold

Stage 4: Quality Scoring

python

def compute_quality_score(text):
    score = 0
    # Length appropriateness (1-3 points)
    if 50 <= len(text) <= 1000:
        score += 3
    elif 20 <= len(text) < 50 or 1000 < len(text) <= 3000:
        score += 2
    else:
        score += 1

    # Vocabulary richness (1-3 points)
    unique_words = len(set(text.split()))
    total_words = len(text.split())
    if total_words > 0:
        vocab_ratio = unique_words / total_words
        if vocab_ratio > 0.7:
            score += 3
        elif vocab_ratio > 0.5:
            score += 2
        else:
            score += 1

    # No HTML/formatting artifacts (1-2 points)
    if not ('<' in text or '>' in text or '{' in text):
        score += 2

    # Proper sentences (1-2 points)
    if text.count('.') >= 1:  # At least one sentence
        score += 2

    return min(score, 10)  # Cap at 10

Acceptance Criteria

Minimum Quality Thresholds

For Training Set:

•Language purity: >98% Somali
•Duplicate rate: <1%
•Quality score: Average >7.5
•Label confidence: >95%

For Validation/Test Sets:

•Language purity: >99% Somali
•Duplicate rate: 0% (strict)
•Quality score: Average >8.0
•Label confidence: >98% (manually validated)

Quality Guardrails

Automatic Filters

•
Remove if:
- •Non-Somali language detected
- •Exact duplicate found
- •Text length <10 or >5000 characters
- •Quality score <5
- •Contains >20% numbers/special characters
•
Flag for review if:
- •Near-duplicate (>95% similarity)
- •Quality score 5-7
- •Label confidence <90%
- •Unusual character patterns
•
Accept if:
- •All validation checks pass
- •Quality score ≥7
- •No duplicates
- •Language = Somali

Quality Reporting

Metrics to Track

Dataset-Level:

•Total records
•Records passing validation (%)
•Average quality score
•Duplicate count
•Language distribution (% Somali)

Per-Source:

•Source name
•Records contributed
•Average quality score
•Duplicate rate
•Rejection rate

Per-Dialect:

•Dialect label
•Record count
•Average quality score
•Inter-annotator agreement

Example Report:

code

Dataset Quality Report - 2025-11-06

Total Records: 10,000
Passing Validation: 9,200 (92%)
Average Quality Score: 7.8/10
Duplicates Removed: 600 (6%)
Language Purity: 98.5% Somali

Per-Source Quality:
- Wikipedia: 8.5/10 (3,000 records)
- BBC Somali: 8.2/10 (2,500 records)
- Social Media: 6.9/10 (4,500 records, 30% rejected)

Per-Dialect Distribution:
- Northern: 5,500 (59.8%)
- Southern: 2,200 (23.9%)
- Central: 1,500 (16.3%)

When This Skill Activates

This skill auto-invokes when you mention:

•Data quality, data validation, quality checks
•Duplicates, deduplication, duplicate detection
•Quality metrics, quality score, quality standards
•Data cleaning, data filtering, guardrails
•Language detection, language purity
•Acceptance criteria, validation rules

Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier