DB23 6-Step Pipeline

[Created by Sonnet: ea2be943-c1f3-4d1e-b9ae-c326b0eb08d7] [Edited by Opus: 8cdaee7c-f8aa-4e2b-b055-18cff1267dba]

Run the 6-step data ingestion pipeline for Vietnamese stock market trading data.

Critical Requirements

•PYTHONHASHSEED=0 is mandatory for ALL steps
•Sequential execution - run steps 1→6 in order, one day at a time
•Single-day files only - use db23-excel-split skill first for multi-day files
•Step 5 --day flag - MUST specify --day when re-processing existing days
•BACKUP FIRST - Always backup existing data before running pipeline (see below)

Data Backup Warning

CRITICAL: The pipeline OVERWRITES existing data!

When you run the pipeline for a day that already has data in the output folders, the existing files WILL BE OVERWRITTEN without warning:

Step	Output Location	Overwritten?
Step 1	`/db23/parsed_excel_files/{day}.pickle`	YES
Step 2	`/db23/df_new_day_numerical/{day}.pickle`	YES
Step 2	`/db23/df_new_day_str/{day}.pickle`	YES

Before running the pipeline, ALWAYS backup existing data:

bash

# Create backup directory with your agent suffix
backup_dir="/tmp/agent-{suffix}-pickle-backup"
mkdir -p "$backup_dir/parsed_excel_files"
mkdir -p "$backup_dir/df_new_day_numerical"
mkdir -p "$backup_dir/df_new_day_str"

# Backup existing files for the days you're processing
day="2025_12_16"
cp /db23/parsed_excel_files/${day}.pickle "$backup_dir/parsed_excel_files/" 2>/dev/null
cp /db23/df_new_day_numerical/${day}.pickle "$backup_dir/df_new_day_numerical/" 2>/dev/null
cp /db23/df_new_day_str/${day}.pickle "$backup_dir/df_new_day_str/" 2>/dev/null

When to backup:

•Processing days that already exist in the system
•Re-running pipeline after an error
•Any time you're not 100% certain the new data is correct and complete

Why this matters:

•Old data may contain information not present in new data (see Incident 2026_01_06)
•Without backup, you cannot recover from bad pipeline runs
•Backups enable data merging when old + new are complementary

Agent Instructions for Multi-Day Processing

CRITICAL: When processing multiple days, agents MUST follow these rules:

•
Run each day independently - Execute the full 6-step pipeline for one day, then move to the next. NEVER write a bash script or loop to process multiple days at once.
•
Continue on error - If a day fails (any step), do NOT stop processing remaining days. Instead:
- •Use say "Pipeline error on YYYY MM DD" to alert the user
- •Log which step failed
- •Proceed to the next day immediately
•
Report at the end - After all days are attempted, summarize:
- •Days that succeeded
- •Days that failed (and which step)

Example agent behavior for 3 days:

code

# Day 1: 2025_12_17
[Run steps 1-6 for 2025_12_17]
✓ Success → say "Day 2025 12 17 pipeline complete"

# Day 2: 2025_12_18
[Run steps 1-6 for 2025_12_18]
✗ Step 3 fails → say "Pipeline error on 2025 12 18"
→ Continue to next day (don't stop!)

# Day 3: 2025_12_19
[Run steps 1-6 for 2025_12_19]
✓ Success → say "Day 2025 12 19 pipeline complete"

# Final summary
Succeeded: 2025_12_17, 2025_12_19
Failed: 2025_12_18 (Step 3)

Why no multi-day scripts?

•Each day's pickle depends on prior state
•Errors are easier to diagnose day-by-day
•User can intervene between days if needed

Pipeline Overview

Step	Script	What it does
1	`parse_excel_file.py`	Parse Excel → `/db23/parsed_excel_files/{day}.pickle`
2	`label_parsed_excel_file_new.py`	Label, convert string→numerical, update s2i/i2s
3	`push_day_to_df_trans.py`	Append day to master `df_trans_numerical.pickle`
4	`push_s2i_and_latest_day_to_remote.py`	Upload pickles to remote server (m3u)
5	`remote_compute_df_trans.py`	Trigger remote merge via `merge_to_df_trans.py`
6	`get_latest_day_from_m3u_then_push_to_kame.py`	Sync results to kame

Quick Start - Full Pipeline

bash

day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/parse_excel_file.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Quick Start - Steps 2-6 Only (Crafted Pickles)

For days with crafted pickles (no Excel file), skip Step 1:

bash

day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Individual Steps

Step 1: Parse Excel

bash

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/parse_excel_file.py --day YYYY_MM_DD

Input: /home/ubuntu/Downloads/5 DDMM.xlsx Output: /db23/parsed_excel_files/YYYY_MM_DD.pickle Time: ~5-6 seconds

Skip if: Using crafted pickle (Step 1 would overwrite it!)

Step 2: Label and Convert

bash

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/app/label_parsed_excel_file_new.py --day YYYY_MM_DD

Input: /db23/parsed_excel_files/YYYY_MM_DD.pickle Output: /db23/df_new_day_numerical/YYYY_MM_DD.pickle, /db23/df_new_day_str/YYYY_MM_DD.pickle Time: ~45-50 seconds

NaN check output (expected):

code

day 0 nan count
stock 0 nan count
stk 0 nan count
name 0 nan count
id 0 nan count
address 0 nan count
group 179134 nan count   <- EXPECTED (filled later)

Step 3: Push to df_trans

bash

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/push_day_to_df_trans.py --day YYYY_MM_DD

Input: /db23/df_new_day_numerical/YYYY_MM_DD.pickle Output: Updated /db23/pickles/df_trans_numerical.pickle Time: ~15-20 seconds

Success indicator: SUCCEEDED! (/db23/pickles/df_trans_numerical.pickle)

Step 4: Upload to Remote

bash

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/app/push_s2i_and_latest_day_to_remote.py --day YYYY_MM_DD

Uploads to: m3u (sotola@m3u:22222) Time: ~30-60 seconds (network dependent)

Step 5: Remote Compute + Verification

bash

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/remote_compute_df_trans.py --day YYYY_MM_DD

Remote script: merge_to_df_trans.py (replaces old known_good_version.py) Time: ~40-70 seconds (+ ~10s for verification)

CRITICAL: The --day flag is REQUIRED when re-processing existing days. Without it, the remote script auto-detects the "latest" day by filename, which may not be the day you're processing.

•New day (not in remote df_trans): --day optional
•Existing day (re-processing): --day REQUIRED

Integrated Verification: Step 5 now automatically verifies after merge:

•Local row count - reads /db23/pickles/df_trans_numerical.pickle
•Remote row count - reads remote df_trans via SSH
•NaN check - ensures no NaN in stk column (both local & remote)
•Match check - local and remote row counts must be equal

Verification Output:

code

=== VERIFICATION: 2025_12_17 ===

=== LOCAL VERIFICATION ===
Local row count for 2025_12_17: 50,149
✓ Local stk column: No NaN values

=== REMOTE VERIFICATION ===
Remote row count for 2025_12_17: 50,149
✓ Remote stk column: No NaN values

=== VERIFICATION SUMMARY ===
✓ Local: 50,149 rows, no NaN
✓ Remote: 50,149 rows, no NaN
✓ Row counts MATCH: 50,149

=== ALL VERIFICATIONS PASSED ===

If verification fails:

•Local NaN in stk: Steps 1-4 were compromised - re-run from Step 1
•Remote NaN in stk: Remote merge issue - re-run Step 5
•Row count mismatch: Remote didn't process the uploaded day - check --day flag

Step 6: Sync to Kame

bash

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Syncs to: kame (kame@kamegoku:22)

Expected Timeouts

Step	Normal	Recommended Timeout
Step 1	6s	10s
Step 2	50s	70s
Step 3	20s	30s
Step 4	60s	90s
Step 5	70s	100s
Step 6	30s	60s
Full pipeline	~4 min	~6 min

Data Locations

Path	Contents
`/home/ubuntu/Downloads/`	Input Excel files (`5 DDMM.xlsx`)
`/db23/parsed_excel_files/`	Per-day parsed pickles
`/db23/df_new_day_numerical/`	Daily numerical dataframes
`/db23/df_new_day_str/`	Daily string dataframes
`/db23/pickles/`	Master pickles (df_trans, s2i/i2s)

Common Mistakes

Mistake	Consequence	Prevention
Running Step 1 on crafted day	Overwrites crafted pickle with empty/wrong data	Skip Step 1 for crafted days
Missing PYTHONHASHSEED=0	KeyError with large negative number	Always include in every command
Running steps in parallel	Data corruption	Run sequentially
Processing multiple days at once	Pickle conflicts	One day at a time
Step 5 without --day for existing day	Remote processes wrong day (latest by filename)	Always pass --day for historical days

Communication Protocol

bash

# On success
say "Day YYYY MM DD pipeline complete"

# On failure
say "Pipeline error on YYYY MM DD"

Remote Servers

Server	SSH	Purpose
m3u	`sotola@m3u:22222`	Primary compute
kame	`kame@kamegoku:22`	Backup/sync

Script Locations

Script	Local	Remote
`merge_to_df_trans.py`	`/Users/sotola/PycharmProjects/remote_to_m3_ultra/workflows/`	Same path
`known_good_version.py`	Same directory	Deprecated - ignored --day flag

Incident Handling

This section documents resolved incidents and their solutions. When encountering similar issues, check here first.

Incident 2026_01_06: Complementary Data Overwrite (Day 2025_12_16)

Problem:

•Old data for 2025_12_16 had 102,122 rows of regular orders only (is_pt=0)
•New Excel file for 2025_12_16 had 51 rows of PT transactions only (Sheet1 empty, Sheet2 had 51 PT)
•Running the pipeline overwrote old data with new, losing 102,122 regular orders

Symptoms:

•After pipeline: Dec 16 had only 102 rows (51 PT × 2 for buyer+seller)
•Before pipeline: Dec 16 had 102,122 rows (all non-PT)
•Data was complementary, not replacement

Root Cause:

•Multi-day Excel file was split, but the split only contained PT data for Dec 16
•Step 1 overwrote the existing parsed pickle with PT-only data
•No backup was made before running

Resolution:

•Restore old Step 1 output from backup: /tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle

•Concatenate old (102,122 non-PT) + new (102 PT) with matching columns:

python

import pandas as pd
df_old = pd.read_pickle('/tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle')
df_new = pd.read_pickle('/db23/parsed_excel_files/2025_12_16.pickle')
# Reorder columns to match, cast price dtype
df_new = df_new[df_old.columns]
df_new['price'] = df_new['price'].astype(df_old['price'].dtype)
df_combined = pd.concat([df_old, df_new], ignore_index=True)
df_combined.to_pickle('/db23/parsed_excel_files/2025_12_16.pickle')

•Re-run Steps 2-6 for the day
•Verify: 102,224 rows (102,122 non-PT + 102 PT)

Prevention:

•Always backup before running pipeline on existing days
•Check new data source for completeness before processing
•Compare row counts: if new << old, investigate before proceeding

Key Lesson: Old and new data can be complementary (each has data the other doesn't), not just replacement (new supersedes old). When in doubt, backup and compare before overwriting.