Data Analysis & Science Skills
1. Pandas Best Practices
- •Vectorization: Avoid iterating over rows (
iterrows). Use vectorized operations for simulation data processing.- •Bad:
[row['a'] + row['b'] for _, row in df.iterrows()] - •Good:
df['a'] + df['b']
- •Bad:
- •Missing Values: Explicitly handle
NaN. Simulation data must be clean.- •Use
fillna()ordropna()with valid reasoning documented.
- •Use
- •Loading Data:
- •Use
pathlibfor paths. - •Load constants from
utils/constants/limits.pyif filtering during load.
- •Use
2. Statistics (SciPy & Stats models)
- •Comparisons: Use
uci.statsmodule.- •
Friedmantest for multiple group comparisons. - •
Wilcoxonfor paired comparisons.
- •
- •Distributions: Refer to
uci.distribucionesfor random variable generation (e.g.,norm,expon).
3. Machine Learning (Scikit-Learn)
- •Pipelines: Use
sklearn.pipeline.Pipelinefor preprocessing + modeling. - •Persistence: Save/Load models using
joblibinmodels/directory. - •Reproducibility: Always set
random_state(or seed) for stochastic models.
4. Simulation Data (SimPy Integration)
- •Output: Simulation runs should generate consistent DataFrames.
- •Structure:
- •Rows: Individual simulation events or patient runs.
- •Columns: Metrics (time in UCI, survival status, cost).
- •Aggregations: Calculate aggregate stats (mean, std, percentiles) after collecting full simulation batch results.