AgentSkillsCN

db23-pipeline

运行 6 步 db23 交易数据摄入流水线。适用于通过流水线处理单日 Excel 文件时使用,当用户提及“运行流水线”、“6 步”、“推送当日”、“标记当日”,或需要将新交易日摄入 df_trans 时使用。涵盖从 Excel 解析到远程同步的所有步骤。

SKILL.md
--- frontmatter
name: db23-pipeline
description: Run the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync.

DB23 6-Step Pipeline

[Created by Sonnet: ea2be943-c1f3-4d1e-b9ae-c326b0eb08d7] [Edited by Opus: 8cdaee7c-f8aa-4e2b-b055-18cff1267dba]

Run the 6-step data ingestion pipeline for Vietnamese stock market trading data.

Critical Requirements

  • PYTHONHASHSEED=0 is mandatory for ALL steps
  • Sequential execution - run steps 1→6 in order, one day at a time
  • Single-day files only - use db23-excel-split skill first for multi-day files
  • Step 5 --day flag - MUST specify --day when re-processing existing days
  • BACKUP FIRST - Always backup existing data before running pipeline (see below)

Data Backup Warning

CRITICAL: The pipeline OVERWRITES existing data!

When you run the pipeline for a day that already has data in the output folders, the existing files WILL BE OVERWRITTEN without warning:

StepOutput LocationOverwritten?
Step 1/db23/parsed_excel_files/{day}.pickleYES
Step 2/db23/df_new_day_numerical/{day}.pickleYES
Step 2/db23/df_new_day_str/{day}.pickleYES

Before running the pipeline, ALWAYS backup existing data:

bash
# Create backup directory with your agent suffix
backup_dir="/tmp/agent-{suffix}-pickle-backup"
mkdir -p "$backup_dir/parsed_excel_files"
mkdir -p "$backup_dir/df_new_day_numerical"
mkdir -p "$backup_dir/df_new_day_str"

# Backup existing files for the days you're processing
day="2025_12_16"
cp /db23/parsed_excel_files/${day}.pickle "$backup_dir/parsed_excel_files/" 2>/dev/null
cp /db23/df_new_day_numerical/${day}.pickle "$backup_dir/df_new_day_numerical/" 2>/dev/null
cp /db23/df_new_day_str/${day}.pickle "$backup_dir/df_new_day_str/" 2>/dev/null

When to backup:

  • Processing days that already exist in the system
  • Re-running pipeline after an error
  • Any time you're not 100% certain the new data is correct and complete

Why this matters:

  • Old data may contain information not present in new data (see Incident 2026_01_06)
  • Without backup, you cannot recover from bad pipeline runs
  • Backups enable data merging when old + new are complementary

Agent Instructions for Multi-Day Processing

CRITICAL: When processing multiple days, agents MUST follow these rules:

  1. Run each day independently - Execute the full 6-step pipeline for one day, then move to the next. NEVER write a bash script or loop to process multiple days at once.

  2. Continue on error - If a day fails (any step), do NOT stop processing remaining days. Instead:

    • Use say "Pipeline error on YYYY MM DD" to alert the user
    • Log which step failed
    • Proceed to the next day immediately
  3. Report at the end - After all days are attempted, summarize:

    • Days that succeeded
    • Days that failed (and which step)

Example agent behavior for 3 days:

code
# Day 1: 2025_12_17
[Run steps 1-6 for 2025_12_17]
✓ Success → say "Day 2025 12 17 pipeline complete"

# Day 2: 2025_12_18
[Run steps 1-6 for 2025_12_18]
✗ Step 3 fails → say "Pipeline error on 2025 12 18"
→ Continue to next day (don't stop!)

# Day 3: 2025_12_19
[Run steps 1-6 for 2025_12_19]
✓ Success → say "Day 2025 12 19 pipeline complete"

# Final summary
Succeeded: 2025_12_17, 2025_12_19
Failed: 2025_12_18 (Step 3)

Why no multi-day scripts?

  • Each day's pickle depends on prior state
  • Errors are easier to diagnose day-by-day
  • User can intervene between days if needed

Pipeline Overview

StepScriptWhat it does
1parse_excel_file.pyParse Excel → /db23/parsed_excel_files/{day}.pickle
2label_parsed_excel_file_new.pyLabel, convert string→numerical, update s2i/i2s
3push_day_to_df_trans.pyAppend day to master df_trans_numerical.pickle
4push_s2i_and_latest_day_to_remote.pyUpload pickles to remote server (m3u)
5remote_compute_df_trans.pyTrigger remote merge via merge_to_df_trans.py
6get_latest_day_from_m3u_then_push_to_kame.pySync results to kame

Quick Start - Full Pipeline

bash
day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/parse_excel_file.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Quick Start - Steps 2-6 Only (Crafted Pickles)

For days with crafted pickles (no Excel file), skip Step 1:

bash
day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Individual Steps

Step 1: Parse Excel

bash
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/parse_excel_file.py --day YYYY_MM_DD

Input: /home/ubuntu/Downloads/5 DDMM.xlsx Output: /db23/parsed_excel_files/YYYY_MM_DD.pickle Time: ~5-6 seconds

Skip if: Using crafted pickle (Step 1 would overwrite it!)

Step 2: Label and Convert

bash
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/app/label_parsed_excel_file_new.py --day YYYY_MM_DD

Input: /db23/parsed_excel_files/YYYY_MM_DD.pickle Output: /db23/df_new_day_numerical/YYYY_MM_DD.pickle, /db23/df_new_day_str/YYYY_MM_DD.pickle Time: ~45-50 seconds

NaN check output (expected):

code
day 0 nan count
stock 0 nan count
stk 0 nan count
name 0 nan count
id 0 nan count
address 0 nan count
group 179134 nan count   <- EXPECTED (filled later)

Step 3: Push to df_trans

bash
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/push_day_to_df_trans.py --day YYYY_MM_DD

Input: /db23/df_new_day_numerical/YYYY_MM_DD.pickle Output: Updated /db23/pickles/df_trans_numerical.pickle Time: ~15-20 seconds

Success indicator: SUCCEEDED! (/db23/pickles/df_trans_numerical.pickle)

Step 4: Upload to Remote

bash
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/app/push_s2i_and_latest_day_to_remote.py --day YYYY_MM_DD

Uploads to: m3u (sotola@m3u:22222) Time: ~30-60 seconds (network dependent)

Step 5: Remote Compute + Verification

bash
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/remote_compute_df_trans.py --day YYYY_MM_DD

Remote script: merge_to_df_trans.py (replaces old known_good_version.py) Time: ~40-70 seconds (+ ~10s for verification)

CRITICAL: The --day flag is REQUIRED when re-processing existing days. Without it, the remote script auto-detects the "latest" day by filename, which may not be the day you're processing.

  • New day (not in remote df_trans): --day optional
  • Existing day (re-processing): --day REQUIRED

Integrated Verification: Step 5 now automatically verifies after merge:

  1. Local row count - reads /db23/pickles/df_trans_numerical.pickle
  2. Remote row count - reads remote df_trans via SSH
  3. NaN check - ensures no NaN in stk column (both local & remote)
  4. Match check - local and remote row counts must be equal

Verification Output:

code
=== VERIFICATION: 2025_12_17 ===

=== LOCAL VERIFICATION ===
Local row count for 2025_12_17: 50,149
✓ Local stk column: No NaN values

=== REMOTE VERIFICATION ===
Remote row count for 2025_12_17: 50,149
✓ Remote stk column: No NaN values

=== VERIFICATION SUMMARY ===
✓ Local: 50,149 rows, no NaN
✓ Remote: 50,149 rows, no NaN
✓ Row counts MATCH: 50,149

=== ALL VERIFICATIONS PASSED ===

If verification fails:

  • Local NaN in stk: Steps 1-4 were compromised - re-run from Step 1
  • Remote NaN in stk: Remote merge issue - re-run Step 5
  • Row count mismatch: Remote didn't process the uploaded day - check --day flag

Step 6: Sync to Kame

bash
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Syncs to: kame (kame@kamegoku:22)

Expected Timeouts

StepNormalRecommended Timeout
Step 16s10s
Step 250s70s
Step 320s30s
Step 460s90s
Step 570s100s
Step 630s60s
Full pipeline~4 min~6 min

Data Locations

PathContents
/home/ubuntu/Downloads/Input Excel files (5 DDMM.xlsx)
/db23/parsed_excel_files/Per-day parsed pickles
/db23/df_new_day_numerical/Daily numerical dataframes
/db23/df_new_day_str/Daily string dataframes
/db23/pickles/Master pickles (df_trans, s2i/i2s)

Common Mistakes

MistakeConsequencePrevention
Running Step 1 on crafted dayOverwrites crafted pickle with empty/wrong dataSkip Step 1 for crafted days
Missing PYTHONHASHSEED=0KeyError with large negative numberAlways include in every command
Running steps in parallelData corruptionRun sequentially
Processing multiple days at oncePickle conflictsOne day at a time
Step 5 without --day for existing dayRemote processes wrong day (latest by filename)Always pass --day for historical days

Communication Protocol

bash
# On success
say "Day YYYY MM DD pipeline complete"

# On failure
say "Pipeline error on YYYY MM DD"

Remote Servers

ServerSSHPurpose
m3usotola@m3u:22222Primary compute
kamekame@kamegoku:22Backup/sync

Script Locations

ScriptLocalRemote
merge_to_df_trans.py/Users/sotola/PycharmProjects/remote_to_m3_ultra/workflows/Same path
known_good_version.pySame directoryDeprecated - ignored --day flag

Incident Handling

This section documents resolved incidents and their solutions. When encountering similar issues, check here first.

Incident 2026_01_06: Complementary Data Overwrite (Day 2025_12_16)

Problem:

  • Old data for 2025_12_16 had 102,122 rows of regular orders only (is_pt=0)
  • New Excel file for 2025_12_16 had 51 rows of PT transactions only (Sheet1 empty, Sheet2 had 51 PT)
  • Running the pipeline overwrote old data with new, losing 102,122 regular orders

Symptoms:

  • After pipeline: Dec 16 had only 102 rows (51 PT × 2 for buyer+seller)
  • Before pipeline: Dec 16 had 102,122 rows (all non-PT)
  • Data was complementary, not replacement

Root Cause:

  • Multi-day Excel file was split, but the split only contained PT data for Dec 16
  • Step 1 overwrote the existing parsed pickle with PT-only data
  • No backup was made before running

Resolution:

  1. Restore old Step 1 output from backup: /tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle
  2. Concatenate old (102,122 non-PT) + new (102 PT) with matching columns:
    python
    import pandas as pd
    df_old = pd.read_pickle('/tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle')
    df_new = pd.read_pickle('/db23/parsed_excel_files/2025_12_16.pickle')
    # Reorder columns to match, cast price dtype
    df_new = df_new[df_old.columns]
    df_new['price'] = df_new['price'].astype(df_old['price'].dtype)
    df_combined = pd.concat([df_old, df_new], ignore_index=True)
    df_combined.to_pickle('/db23/parsed_excel_files/2025_12_16.pickle')
    
  3. Re-run Steps 2-6 for the day
  4. Verify: 102,224 rows (102,122 non-PT + 102 PT)

Prevention:

  • Always backup before running pipeline on existing days
  • Check new data source for completeness before processing
  • Compare row counts: if new << old, investigate before proceeding

Key Lesson: Old and new data can be complementary (each has data the other doesn't), not just replacement (new supersedes old). When in doubt, backup and compare before overwriting.