Data Engineering Skill
Quick Reference
| Role | Focus | Timeline | Entry From |
|---|---|---|---|
| Data Engineer | Pipelines, Infra | 12-24 mo | Backend Dev |
| ML Engineer | Models, Features | 12-24 mo | Data Scientist |
| AI Engineer | LLMs, Agents | 6-12 mo | Any Developer |
Learning Paths
Data Engineer
code
[1] SQL Mastery (4-6 wk)
│ └─ Window functions, CTEs, optimization
│
▼
[2] Python for Data (4-6 wk)
│ └─ Pandas, file formats, scripting
│
▼
[3] ETL/ELT Pipelines (6-8 wk)
│ └─ Extract, transform, load patterns
│
▼
[4] Big Data: Spark (8-12 wk)
│ └─ PySpark, DataFrames, partitioning
│
▼
[5] Data Warehouse (4-6 wk)
│ └─ Star schema, dbt, Snowflake/BQ
│
▼
[6] Orchestration (4-6 wk)
└─ Airflow/Prefect, scheduling, monitoring
2025 Stack: Python + Spark + Airflow + dbt + Snowflake/BigQuery
ML Engineer
code
[1] Python + NumPy (4-6 wk)
│
▼
[2] Math Foundations (6-8 wk)
│ └─ Linear algebra, calculus, statistics
│
▼
[3] Classical ML (8-12 wk)
│ └─ scikit-learn, XGBoost, evaluation
│
▼
[4] Deep Learning (8-12 wk)
│ └─ PyTorch, CNNs, Transformers
│
▼
[5] MLOps (6-8 wk)
└─ MLflow, model serving, monitoring
2025 Stack: Python + PyTorch + scikit-learn + MLflow + W&B
AI Engineer (2025 Hot Path)
code
[1] LLM Fundamentals (2-3 wk)
│ └─ Tokens, embeddings, context windows
│
▼
[2] Prompt Engineering (2-3 wk)
│ └─ Few-shot, CoT, structured output
│
▼
[3] RAG Systems (3-4 wk)
│ └─ Embeddings, vector DBs, retrieval
│
▼
[4] AI Agents (4-6 wk)
│ └─ Tool calling, agent loops, memory
│
▼
[5] Production Deploy (ongoing)
└─ Evaluation, guardrails, monitoring
2025 Stack: Python + LangChain/LlamaIndex + OpenAI/Anthropic + ChromaDB
2025 Tool Matrix
Data Processing
| Tool | Scale | Use Case |
|---|---|---|
| Pandas | <10GB | Prototyping, small data |
| Polars | <100GB | Fast local processing |
| Spark | >100GB | Distributed processing |
| dbt | Any | Transformations, testing |
ML Frameworks
| Framework | Best For | Complexity |
|---|---|---|
| scikit-learn | Classical ML | Low |
| XGBoost | Tabular data | Low |
| PyTorch | Research, flexibility | Medium |
| TensorFlow | Production, mobile | Medium |
LLM/AI Tools
| Tool | Use Case |
|---|---|
| LangChain | LLM orchestration |
| LlamaIndex | RAG systems |
| Claude/OpenAI | LLM APIs |
| ChromaDB | Vector storage |
Algorithm Reference
Classical ML
| Type | Algorithms |
|---|---|
| Regression | Linear, Ridge, Lasso, ElasticNet |
| Classification | Logistic, SVM, Decision Tree |
| Ensemble | Random Forest, XGBoost, LightGBM |
| Clustering | K-Means, DBSCAN, Hierarchical |
Deep Learning
| Architecture | Use Case |
|---|---|
| CNN | Images, vision |
| RNN/LSTM | Sequences |
| Transformer | NLP, LLMs |
| Diffusion | Image generation |
AI Agent Architecture (2025)
code
┌─────────────────────────────────────────┐ │ AGENTIC LOOP │ ├─────────────────────────────────────────┤ │ PERCEIVE → REASON → ACT → REFLECT │ │ │ │ │ │ │ │ │ │ │ └─► Loop │ │ │ │ └─► Execute tools│ │ │ └─► LLM decides action │ │ └─► Gather context, observations │ └─────────────────────────────────────────┘ Design Patterns (Anthropic 2025): • Prompt Chaining - Sequential fixed steps • Routing - Classify and dispatch • Parallelization - Concurrent subtasks • Orchestrator-Workers - Central delegation • Evaluator-Optimizer - Generate + critique
Troubleshooting
code
Which path to choose? ├─► Love building infrastructure? → Data Engineer ├─► Love algorithms/math? → ML Engineer ├─► Want fastest AI entry? → AI Engineer └─► Uncertain? → Start with Python + SQL Model not performing well? ├─► Data quality issues? → Clean data first ├─► Feature engineering? → Create better features ├─► Wrong algorithm? → Try different models ├─► Overfitting? → More data, regularization └─► Hyperparameters? → Grid/random search LLM giving bad answers? ├─► Prompt too vague? → Be more specific ├─► Missing context? → Add relevant info ├─► Hallucinating? → Use RAG, verify facts └─► Wrong tool? → Improve tool descriptions
Common Failure Modes
| Symptom | Root Cause | Recovery |
|---|---|---|
| Model fails in prod | Data drift | Monitor distributions |
| Pipeline always late | Unoptimized queries | Profile, partition |
| RAG finds wrong docs | Bad chunking | Tune chunk size, overlap |
| Agent loops forever | No exit condition | Add max iterations |
Portfolio Projects
Data Engineering
- •ETL Pipeline (Airflow + dbt)
- •Real-time Streaming (Kafka + Spark)
- •Data Warehouse Design
ML Engineering
- •Classification Model (scikit-learn)
- •Deep Learning Model (PyTorch)
- •ML Pipeline (MLflow)
AI Engineering
- •RAG Chatbot (LangChain + ChromaDB)
- •AI Agent with Tools
- •Multi-Agent System
Next Actions
Specify your target role for a detailed learning plan.