CSV/Excel Merger
Intelligently merge multiple CSV or Excel files with automatic column matching and data deduplication.
Instructions
When a user needs to merge CSV or Excel files:
- •
Analyze Input Files:
- •How many files need to be merged?
- •What format (CSV, Excel, TSV)?
- •Are the files provided or need to be read from disk?
- •Do columns have the same names across files?
- •What is the primary key (unique identifier)?
- •
Inspect File Structures:
- •Read headers from each file
- •Identify column names and data types
- •Detect encoding (UTF-8, Latin-1, etc.)
- •Check for missing columns
- •Look for duplicate column names
- •
Create Merge Strategy:
Column Matching:
- •Exact name match: "email" = "email"
- •Case-insensitive: "Email" = "email"
- •Fuzzy match: "E-mail" ≈ "email"
- •Common patterns:
- •"first_name", "firstname", "First Name" → "first_name"
- •"phone", "phone_number", "tel" → "phone"
- •"email", "e-mail", "email_address" → "email"
Conflict Resolution (when same record appears in multiple files):
- •Keep first: Use value from first file
- •Keep last: Use value from last file (most recent)
- •Keep longest: Use most complete value
- •Manual review: Flag conflicts for user review
- •Merge: Combine non-conflicting fields
Deduplication:
- •Identify duplicate rows based on primary key
- •Options: keep first, keep last, keep all, merge values
- •Track source file for each row
- •
Perform Merge:
python# Example merge logic import pandas as pd # Read files df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') # Normalize column names df1.columns = df1.columns.str.lower().str.strip() df2.columns = df2.columns.str.lower().str.strip() # Map similar columns column_mapping = { 'firstname': 'first_name', 'e_mail': 'email', # ... } df2 = df2.rename(columns=column_mapping) # Merge merged = pd.concat([df1, df2], ignore_index=True) # Deduplicate merged = merged.drop_duplicates(subset=['email'], keep='last') # Save merged.to_csv('merged_output.csv', index=False) - •
Format Output:
code📊 CSV/EXCEL MERGER REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📁 INPUT FILES ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ File 1: contacts_jan.csv Rows: 1,245 Columns: 8 (name, email, phone, company, ...) File 2: contacts_feb.csv Rows: 987 Columns: 9 (firstname, lastname, email, mobile, ...) File 3: leads_export.xlsx Rows: 2,103 Columns: 12 (full_name, email_address, phone, ...) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔄 COLUMN MAPPING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Unified Schema: • first_name ← [firstname, first name, fname] • last_name ← [lastname, last name, lname] • email ← [email, e-mail, email_address] • phone ← [phone, mobile, phone_number, tel] • company ← [company, organization, org] • title ← [title, job_title, position] • source ← [file origin tracking] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔍 MERGE ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total rows before merge: 4,335 Duplicate records found: 892 Conflicts detected: 47 Deduplication Strategy: Keep most recent (by source file date) Primary Key: email ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ CONFLICTS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Record: john.doe@example.com File 1 phone: (555) 123-4567 File 2 phone: (555) 987-6543 Resolution: Kept most recent (File 2) [List top 10 conflicts] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✅ MERGE RESULTS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Output File: merged_contacts.csv Total Rows: 3,443 Columns: 7 Duplicates Removed: 892 Breakdown by Source: • contacts_jan.csv: 1,245 rows (398 unique) • contacts_feb.csv: 987 rows (521 unique) • leads_export.xlsx: 2,103 rows (2,524 unique) Data Quality: • Email completeness: 98.2% • Phone completeness: 87.5% • Company completeness: 91.3% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 💡 RECOMMENDATIONS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • Review 47 conflict records manually • Standardize phone number format • Fill missing company names (8.7% incomplete) • Export conflicts to: conflicts_review.csv
- •
Handle Special Cases:
Multiple Primary Keys:
- •Use compound keys: (email + company)
- •Offer options when ambiguous
Different Data Types:
- •Convert dates to standard format
- •Normalize phone numbers
- •Standardize country codes
- •Clean whitespace and casing
Missing Columns:
- •Fill with empty values
- •Flag missing data
- •Offer to create new columns
Large Files:
- •Use chunking for files > 100MB
- •Show progress indicator
- •Estimate memory usage
- •
Generate Code: Provide Python/pandas script that:
- •Reads all files
- •Performs intelligent column matching
- •Deduplicates based on strategy
- •Resolves conflicts
- •Saves merged output
- •Generates detailed report
- •
Export Options:
- •CSV (UTF-8)
- •Excel (.xlsx)
- •JSON
- •SQL INSERT statements
- •Parquet (for large datasets)
Example Triggers
- •"Merge these three CSV files"
- •"Combine multiple Excel sheets into one file"
- •"Deduplicate and merge customer data"
- •"Join spreadsheets with different column names"
- •"Consolidate contact lists from different sources"
Best Practices
Column Matching:
- •Use fuzzy matching for similar names
- •Maintain original column name mapping report
- •Allow manual override of auto-matching
Data Quality:
- •Trim whitespace
- •Standardize formats (phone, email, dates)
- •Detect and flag invalid data
- •Preserve data types
Performance:
- •Use chunking for large files
- •Process in batches
- •Show progress for long operations
- •Optimize memory usage
Transparency:
- •Log all merge decisions
- •Track source file for each row
- •Report conflicts and resolutions
- •Generate detailed merge report
Output Quality
Ensure merges:
- •Intelligently match columns
- •Handle different schemas
- •Deduplicate properly
- •Preserve data integrity
- •Flag conflicts for review
- •Generate comprehensive report
- •Maintain data quality
- •Track data lineage (source)
- •Handle edge cases gracefully
- •Provide validation statistics
Generate clean, deduplicated merged files with full transparency and data quality checks.