Exploratory Data Analysis
Overview
Perform comprehensive exploratory data analysis (EDA) on scientific data files across multiple domains. This skill provides automated file type detection, format-specific analysis, data quality assessment, and generates detailed markdown reports suitable for documentation and downstream analysis planning.
Key Capabilities:
- •Automatic detection and analysis of 200+ scientific file formats
- •Comprehensive format-specific metadata extraction
- •Data quality and integrity assessment
- •Statistical summaries and distributions
- •Visualization recommendations
- •Downstream analysis suggestions
- •Markdown report generation
When to Use This Skill
Use this skill when:
- •User provides a path to a scientific data file for analysis
- •User asks to "explore", "analyze", or "summarize" a data file
- •User wants to understand the structure and content of scientific data
- •User needs a comprehensive report of a dataset before analysis
- •User wants to assess data quality or completeness
- •User asks what type of analysis is appropriate for a file
Supported File Categories
The skill has comprehensive coverage of scientific file formats organized into six major categories:
1. Chemistry and Molecular Formats (60+ extensions)
Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases.
File types include: .pdb, .cif, .mol, .mol2, .sdf, .xyz, .smi, .gro, .log, .fchk, .cube, .dcd, .xtc, .trr, .prmtop, .psf, and more.
Reference file: references/chemistry_molecular_formats.md
2. Bioinformatics and Genomics Formats (50+ extensions)
Sequence data, alignments, annotations, variants, and expression data.
File types include: .fasta, .fastq, .sam, .bam, .vcf, .bed, .gff, .gtf, .bigwig, .h5ad, .loom, .counts, .mtx, and more.
Reference file: references/bioinformatics_genomics_formats.md
3. Microscopy and Imaging Formats (45+ extensions)
Microscopy images, medical imaging, whole slide imaging, and electron microscopy.
File types include: .tif, .nd2, .lif, .czi, .ims, .dcm, .nii, .mrc, .dm3, .vsi, .svs, .ome.tiff, and more.
Reference file: references/microscopy_imaging_formats.md
4. Spectroscopy and Analytical Chemistry Formats (35+ extensions)
NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques.
File types include: .fid, .mzML, .mzXML, .raw, .mgf, .spc, .jdx, .xy, .cif (crystallography), .wdf, and more.
Reference file: references/spectroscopy_analytical_formats.md
5. Proteomics and Metabolomics Formats (30+ extensions)
Mass spec proteomics, metabolomics, lipidomics, and multi-omics data.
File types include: .mzML, .pepXML, .protXML, .mzid, .mzTab, .sky, .mgf, .msp, .h5ad, and more.
Reference file: references/proteomics_metabolomics_formats.md
6. General Scientific Data Formats (30+ extensions)
Arrays, tables, hierarchical data, compressed archives, and common scientific formats.
File types include: .npy, .npz, .csv, .xlsx, .json, .hdf5, .zarr, .parquet, .mat, .fits, .nc, .xml, and more.
Reference file: references/general_scientific_formats.md
Workflow
Step 1: File Type Detection
When a user provides a file path, first identify the file type:
- •Extract the file extension
- •Look up the extension in the appropriate reference file
- •Identify the file category and format description
- •Load format-specific information
Example:
User: "Analyze data.fastq" → Extension: .fastq → Category: bioinformatics_genomics → Format: FASTQ Format (sequence data with quality scores) → Reference: references/bioinformatics_genomics_formats.md
Step 2: Load Format-Specific Information
Based on the file type, read the corresponding reference file to understand:
- •Typical Data: What kind of data this format contains
- •Use Cases: Common applications for this format
- •Python Libraries: How to read the file in Python
- •EDA Approach: What analyses are appropriate for this data type
Search the reference file for the specific extension (e.g., search for "### .fastq" in bioinformatics_genomics_formats.md).
Step 3: Perform Data Analysis
Use the scripts/eda_analyzer.py script OR implement custom analysis:
Option A: Use the analyzer script
# The script automatically: # 1. Detects file type # 2. Loads reference information # 3. Performs format-specific analysis # 4. Generates markdown report python scripts/eda_analyzer.py <filepath> [output.md]
Option B: Custom analysis in the conversation Based on the format information from the reference file, perform appropriate analysis:
For tabular data (CSV, TSV, Excel):
- •Load with pandas
- •Check dimensions, data types
- •Analyze missing values
- •Calculate summary statistics
- •Identify outliers
- •Check for duplicates
For sequence data (FASTA, FASTQ):
- •Count sequences
- •Analyze length distributions
- •Calculate GC content
- •Assess quality scores (FASTQ)
For images (TIFF, ND2, CZI):
- •Check dimensions (X, Y, Z, C, T)
- •Analyze bit depth and value range
- •Extract metadata (channels, timestamps, spatial calibration)
- •Calculate intensity statistics
For arrays (NPY, HDF5):
- •Check shape and dimensions
- •Analyze data type
- •Calculate statistical summaries
- •Check for missing/invalid values
Step 4: Generate Comprehensive Report
Create a markdown report with the following sections:
Required Sections:
- •
Title and Metadata
- •Filename and timestamp
- •File size and location
- •
Basic Information
- •File properties
- •Format identification
- •
File Type Details
- •Format description from reference
- •Typical data content
- •Common use cases
- •Python libraries for reading
- •
Data Analysis
- •Structure and dimensions
- •Statistical summaries
- •Quality assessment
- •Data characteristics
- •
Key Findings
- •Notable patterns
- •Potential issues
- •Quality metrics
- •
Recommendations
- •Preprocessing steps
- •Appropriate analyses
- •Tools and methods
- •Visualization approaches
Template Location
Use assets/report_template.md as a guide for report structure.
Step 5: Save Report
Save the markdown report with a descriptive filename:
- •Pattern:
{original_filename}_eda_report.md - •Example:
experiment_data.fastq→experiment_data_eda_report.md
Detailed Format References
Each reference file contains comprehensive information for dozens of file types. To find information about a specific format:
- •Identify the category from the extension
- •Read the appropriate reference file
- •Search for the section heading matching the extension (e.g., "### .pdb")
- •Extract the format information
Reference File Structure
Each format entry includes:
- •Description: What the format is
- •Typical Data: What it contains
- •Use Cases: Common applications
- •Python Libraries: How to read it (with code examples)
- •EDA Approach: Specific analyses to perform
Example lookup:
### .pdb - Protein Data Bank
**Description:** Standard format for 3D structures of biological macromolecules
**Typical Data:** Atomic coordinates, residue information, secondary structure
**Use Cases:** Protein structure analysis, molecular visualization, docking
**Python Libraries:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA Approach:**
- Structure validation (bond lengths, angles)
- B-factor distribution
- Missing residues detection
- Ramachandran plots
Best Practices
Reading Reference Files
Reference files are large (10,000+ words each). To efficiently use them:
- •
Search by extension: Use grep to find the specific format
pythonimport re with open('references/chemistry_molecular_formats.md', 'r') as f: content = f.read() pattern = r'### \.pdb[^#]*?(?=###|\Z)' match = re.search(pattern, content, re.IGNORECASE | re.DOTALL) - •
Extract relevant sections: Don't load entire reference files into context unnecessarily
- •
Cache format info: If analyzing multiple files of the same type, reuse the format information
Data Analysis
- •Sample large files: For files with millions of records, analyze a representative sample
- •Handle errors gracefully: Many scientific formats require specific libraries; provide clear installation instructions
- •Validate metadata: Cross-check metadata consistency (e.g., stated dimensions vs actual data)
- •Consider data provenance: Note instrument, software versions, processing steps
Report Generation
- •Be comprehensive: Include all relevant information for downstream analysis
- •Be specific: Provide concrete recommendations based on the file type
- •Be actionable: Suggest specific next steps and tools
- •Include code examples: Show how to load and work with the data
Examples
Example 1: Analyzing a FASTQ file
# User provides: "Analyze reads.fastq"
# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'
# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"
# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content
# 4. Generate report
# Include: format description, analysis results, QC recommendations
# 5. Save as: reads_eda_report.md
Example 2: Analyzing a CSV dataset
# User provides: "Explore experiment_results.csv"
# 1. Detect: .csv → general_scientific
# 2. Load reference for CSV format
# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations
# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results
# 5. Save report
Example 3: Analyzing microscopy data
# User provides: "Analyze cells.nd2"
# 1. Detect: .nd2 → microscopy_imaging (Nikon format)
# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader
# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
# Extract: dimensions, channels, timepoints, metadata
# Calculate: intensity statistics, frame info
# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis
# 5. Save report
Troubleshooting
Missing Libraries
Many scientific formats require specialized libraries:
Problem: Import error when trying to read a file
Solution: Provide clear installation instructions
try:
from Bio import SeqIO
except ImportError:
print("Install Biopython: uv pip install biopython")
Common requirements by category:
- •Bioinformatics:
biopython,pysam,pyBigWig - •Chemistry:
rdkit,mdanalysis,cclib - •Microscopy:
tifffile,nd2reader,aicsimageio,pydicom - •Spectroscopy:
nmrglue,pymzml,pyteomics - •General:
pandas,numpy,h5py,scipy
Unknown File Types
If a file extension is not in the references:
- •Ask the user about the file format
- •Check if it's a vendor-specific variant
- •Attempt generic analysis based on file structure (text vs binary)
- •Provide general recommendations
Large Files
For very large files:
- •Use sampling strategies (first N records)
- •Use memory-mapped access (for HDF5, NPY)
- •Process in chunks (for CSV, FASTQ)
- •Provide estimates based on samples
Script Usage
The scripts/eda_analyzer.py can be used directly:
# Basic usage python scripts/eda_analyzer.py data.csv # Specify output file python scripts/eda_analyzer.py data.csv output_report.md # The script will: # 1. Auto-detect file type # 2. Load format references # 3. Perform appropriate analysis # 4. Generate markdown report
The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights.
Advanced Usage
Multi-File Analysis
When analyzing multiple related files:
- •Perform individual EDA on each file
- •Create a summary comparison report
- •Identify relationships and dependencies
- •Suggest integration strategies
Quality Control
For data quality assessment:
- •Check format compliance
- •Validate metadata consistency
- •Assess completeness
- •Identify outliers and anomalies
- •Compare to expected ranges/distributions
Preprocessing Recommendations
Based on data characteristics, recommend:
- •Normalization strategies
- •Missing value imputation
- •Outlier handling
- •Batch correction
- •Format conversions
Resources
scripts/
- •
eda_analyzer.py: Comprehensive analysis script that can be run directly or imported
references/
- •
chemistry_molecular_formats.md: 60+ chemistry/molecular file formats - •
bioinformatics_genomics_formats.md: 50+ bioinformatics formats - •
microscopy_imaging_formats.md: 45+ imaging formats - •
spectroscopy_analytical_formats.md: 35+ spectroscopy formats - •
proteomics_metabolomics_formats.md: 30+ omics formats - •
general_scientific_formats.md: 30+ general formats
assets/
- •
report_template.md: Comprehensive markdown template for EDA reports