Data Research Protocol
Principle: DATA FIRST, CODE SECOND.
Workflow
- •LOAD -- Load data, verify accessibility
- •SCHEMA -- Show structure (types, shape, samples)
- •PROFILE -- Find risks (nulls, duplicates, anomalies)
- •HYPOTHESIS -- What do we want to prove?
- •EXPERIMENT -- One small test
- •DOCUMENT -- Record findings per 5W+H format
Schema Analysis (MANDATORY before any conclusions)
python
print(f"Shape: {df.shape}")
print(f"dtypes:\n{df.dtypes}")
print(f"head:\n{df.head()}")
print(f"nunique:\n{df.nunique()}")
print(f"nulls:\n{df.isnull().sum()}")
Risk Profiling
| Risk | Check | Action |
|---|---|---|
| Missing data | df.isnull().sum() | Document, decide handling |
| Duplicates | df.duplicated().sum() | Investigate |
| Wrong types | Manual inspection | Convert types |
| Outliers | df.describe() | Investigate |
Mini-Experiment Protocol
python
# EXPERIMENT: [Description]
# HYPOTHESIS: [What we expect]
result = df[df['column'] == 'value'].shape[0]
print(f"Result: {result}")
print(f"Expected: {expected}")
print(f"Status: {'PASS' if result == expected else 'FAIL'}")
Rules:
- •One question per experiment
- •Fast (< 30 seconds)
- •Logged (print results)
- •Compared with expectation
Cognitive Bias Prevention
- •Do NOT analyze only first N records (survivorship bias)
- •Do NOT look only for confirmations (confirmation bias)
- •Analyze ALL data
- •Actively look for DISPROOF of hypothesis