Fabricate Trading Day
Generate /db23/parsed_excel_files/{day}.pickle from incomplete CSV whale data.
Quick Start
python scripts/generate_pickle.py \ --csv /path/to/whale_data.csv \ --day 2026_01_06
Output: /db23/parsed_excel_files/2026_01_06.pickle (30-40K rows, balanced, ready for pipeline)
What It Does
Before market open, only partial CSV (~300 whale transactions) is available. This skill creates a complete dataset (30-40K rows) for dashboards by:
- •Loading CSV whale data (real transactions)
- •Adding fake players from reusable pool to balance books
- •Creating PT transactions (3 pairs per stock, all fake)
- •Padding to 30-40K rows
- •Verifying all requirements
- •Saving pickle file
Your deliverable: ONE pickle file User handles: Pipeline steps 2-6 to process your pickle
CSV Format
Required columns: Stock, Account, Name, Buy Order, Buy, Sell Order, Sell, Date
Number format: Plain integers/floats - NO period as thousands separator
- •✓ Correct:
670000or670000.0 - •✗ Wrong:
670.000(Vietnamese format)
Empty cells: Treated as 0
Script Options
- •
--csv: Input CSV path (required) - •
--day: Day string YYYY_MM_DD (required) - •
--fake-pool: Fake pool path (default:/db23/parsed_excel_files/fake_player_pool.pickle) - •
--output: Output path (default:/db23/parsed_excel_files/{day}.pickle)
Data Requirements
The script enforces these requirements:
- •Row count: 30,000-40,000
- •Balance:
sum(buy) == sum(sell)per stock AND total - •PT: 3 pairs per stock (6 rows each stock), all fake, 1000 shares each
- •Types: int32 for volumes, int64 for is_pt, float64 for price
- •No NaN in: stk, name, id, address
- •Order >= matched: buy_order >= buy, sell_order >= sell
Verification
Script automatically verifies and reports:
- •Whale sums match CSV input exactly
- •Each stock balanced individually
- •Total balanced
- •Row count in 30-40K range
- •No NaN in critical fields
- •Correct data types
All checks must pass before pickle is saved.
Critical: NO Scaling
The script parses CSV values as-is. Numbers are already correct.
Previous bug to avoid:
- •Old scripts used
.replace('.', '')for Vietnamese format (2.466.500) - •When CSV has float format (
2466500.0), this creates 10x error:"2466500.0".replace('.', '') → "24665000" - •Solution: Parse as
int(float(value))- NO string replacement
Pipeline Context
Your pickle replaces Step 1 of the 6-step pipeline:
| Step | Script | Your Role |
|---|---|---|
| 1 | parse_excel_file.py | ← YOU REPLACE THIS |
| 2 | label_parsed_excel_file_new.py | User runs |
| 3-6 | ... | User runs |
Pipeline location: /Users/sotola/PycharmProjects/mac_local_m4
Your boundary: Generate pickle → Stop. User handles rest.
Related Documentation
- •Full onboarding:
/Users/sotola/PycharmProjects/db23/docs/onboarding-incomplete-data-processing.md - •Pipeline guide:
/Users/sotola/PycharmProjects/db23/sops/six-step-ingestion-pipeline-doc.md - •Detailed instructions:
/Users/sotola/PycharmProjects/db23/ai/generated_doc/generate-incomplete-day-pickle-instructions.md