AgentSkillsCN

db23-excel-split

将多日 A5 Excel 文件拆分为单日文件,用于 db23 交易数据流水线。适用于处理 /home/ubuntu/Downloads/多日文件中包含多个交易日的 Excel 文件时,当用户提及“拆分 Excel”、“多日 Excel”、“A5 格式”,或需要为 6 步流水线准备 Excel 文件时使用。该技能涵盖工作表名称验证、列结构检查以及数据格式要求。

SKILL.md
--- frontmatter
name: db23-excel-split
description: Split multi-day A5 Excel files into single-day files for the db23 trading data pipeline. Use when processing Excel files in /home/ubuntu/Downloads/multi-day that contain multiple trading days, when the user mentions "split excel", "multi-day excel", "A5 format", or needs to prepare Excel files for the 6-step pipeline. This skill covers validation of sheet names, column structure, and data format requirements.

DB23 Excel Split Guide

Split multi-day A5 Excel workbooks into single-day files that pass through the 6-step data ingestion pipeline.

Prerequisites

The 6-step pipeline only accepts single-day Excel files. Multi-day files must be split first. This skill covers the prerequisite splitting and validation step.

Script Location

code
/Users/sotola/PycharmProjects/db23/ai/db23/split_excel_by_day.py

Quick Start

bash
cd /Users/sotola/PycharmProjects/db23 && \
python ai/db23/split_excel_by_day.py \
  --input "/home/ubuntu/Downloads/multi-day/YOUR_FILE.xlsx" \
  --output-dir "/home/ubuntu/Downloads"

Output: Files named 5 DDMM.xlsx (e.g., 5 2711.xlsx for Nov 27)

Arguments

ArgumentRequiredDescription
--inputYesPath to multi-day Excel file
--output-dirYesDirectory for output files
--templateNoTemplate file to preserve column order
--prefixNoFilename prefix (default: 5)

Required Sheet Structure

Sheet1 (Regular Orders)

Sheet name must be: Sheet1

Required columns:

ColumnTypeDescriptionValidation
ORDER_DATEdateTrading dateMust be valid date, used for splitting
SECURITYSYMBOLstringStock symbolMust be known stock (HPG, SSI, STB, etc.)
CUSTOMERIDstringAccount numberNon-empty, becomes stk field
FULLNAMEstringCustomer nameCan have missing values
IDENTIFICATIONstringCitizen ID (CCCD/CMND)Can have missing values
ADDRESSstringAddressCan have missing values
ORDERSIDEstringBuy/Sell indicatorB or S
SUM_O_VOLintOrder volumeNon-negative integer
SUM_M_VOLintMatched volumeNon-negative integer, ≤ SUM_O_VOL
AVG_PRICEfloatAverage pricePositive number
DEAL_VALfloatDeal valueNon-negative

Sheet2 (PT/Thỏa thuận)

Sheet name must be: Sheet2

Required columns:

ColumnTypeDescriptionValidation
PT_DATEdatePT trading dateMust be valid date
SECURITYSYMBOLstringStock symbolMust match known stocks
BUYERCUSTOMERIDstringBuyer accountNon-empty
FULLNAMEstringBuyer nameCan have missing values
IDENTIFICATIONstringBuyer IDCan have missing values
ADDRESSstringBuyer addressCan have missing values
SELLERCUSTOMERIDstringSeller accountNon-empty
FULLNAME1stringSeller nameCan have missing values
IDENTIFICATION1stringSeller IDCan have missing values
ADDRESS1stringSeller addressCan have missing values
M_VOLintMatched volumePositive integer

Note: Sheet2 must exist even if empty (no PT transactions for a day).

Validation Checklist

Before running the split script, validate the input file:

1. Check Sheet Names

python
import pandas as pd
xls = pd.ExcelFile('/path/to/file.xlsx')
print(f"Sheets: {xls.sheet_names}")
# Must output: ['Sheet1', 'Sheet2']

2. Check Sheet1 Columns

python
df1 = pd.read_excel('/path/to/file.xlsx', sheet_name='Sheet1')
required = ['ORDER_DATE', 'SECURITYSYMBOL', 'CUSTOMERID', 'FULLNAME',
            'IDENTIFICATION', 'ADDRESS', 'ORDERSIDE', 'SUM_O_VOL',
            'SUM_M_VOL', 'AVG_PRICE', 'DEAL_VAL']
missing = [c for c in required if c not in df1.columns]
print(f"Missing Sheet1 columns: {missing}")  # Should be empty

3. Check Sheet2 Columns

python
df2 = pd.read_excel('/path/to/file.xlsx', sheet_name='Sheet2')
required = ['PT_DATE', 'SECURITYSYMBOL', 'BUYERCUSTOMERID', 'FULLNAME',
            'IDENTIFICATION', 'ADDRESS', 'SELLERCUSTOMERID', 'FULLNAME1',
            'IDENTIFICATION1', 'ADDRESS1', 'M_VOL']
missing = [c for c in required if c not in df2.columns]
print(f"Missing Sheet2 columns: {missing}")  # Should be empty

4. Check Date Column Values

python
# Sheet1 dates
dates1 = pd.to_datetime(df1['ORDER_DATE'], errors='coerce')
print(f"Sheet1 date range: {dates1.min()} to {dates1.max()}")
print(f"Sheet1 invalid dates: {dates1.isna().sum()}")

# Sheet2 dates (if not empty)
if len(df2) > 0:
    dates2 = pd.to_datetime(df2['PT_DATE'], errors='coerce')
    print(f"Sheet2 date range: {dates2.min()} to {dates2.max()}")

5. Check Stock Symbols

python
KNOWN_STOCKS = {'HPG', 'SSI', 'STB', 'TCB', 'VPB', 'DXG', 'MBB', 'MSN',
                'MWG', 'VIC', 'VND', 'NVL', 'HDB', 'EIB', 'VRE', 'PDR',
                'LPB', 'VHM', 'CTG', 'FPT', 'VCB', 'HSG', 'VNM', 'ACB',
                'DIG', 'PVD', 'MSB', 'PNJ', 'FRT', 'GAS', 'KBC', 'DGC', 'SHB'}

stocks = set(df1['SECURITYSYMBOL'].dropna().unique())
unknown = stocks - KNOWN_STOCKS
print(f"Unknown stocks: {unknown}")  # Should be empty

6. Verify Output Files

After splitting, verify each output file:

python
import os
output_dir = '/home/ubuntu/Downloads'
for f in os.listdir(output_dir):
    if f.startswith('5 ') and f.endswith('.xlsx'):
        path = os.path.join(output_dir, f)
        xls = pd.ExcelFile(path)
        df1 = pd.read_excel(xls, sheet_name='Sheet1')
        df2 = pd.read_excel(xls, sheet_name='Sheet2')
        print(f"{f}: Sheet1={len(df1)} rows, Sheet2={len(df2)} rows")

Common Issues

IssueSolution
Missing ORDER_DATE columnCheck column names for typos or extra spaces
Sheet named sheet1 (lowercase)Rename to Sheet1
Multiple date formatsStandardize to YYYY-MM-DD or DD/MM/YYYY
Missing Sheet2Create empty Sheet2 with required column headers
Unknown stock symbolsMap to known symbols or flag for review

Example Workflow

bash
# 1. List files in multi-day folder
ls /home/ubuntu/Downloads/multi-day/

# 2. Validate a file before splitting
cd /Users/sotola/PycharmProjects/db23 && \
python -c "
import pandas as pd
xls = pd.ExcelFile('/home/ubuntu/Downloads/multi-day/5 273110.xlsx')
print('Sheets:', xls.sheet_names)
df1 = pd.read_excel(xls, sheet_name='Sheet1')
print('Sheet1 columns:', list(df1.columns))
print('Date range:', pd.to_datetime(df1['ORDER_DATE']).min(), 'to', pd.to_datetime(df1['ORDER_DATE']).max())
"

# 3. Run the split
python ai/db23/split_excel_by_day.py \
  --input "/home/ubuntu/Downloads/multi-day/5 273110.xlsx" \
  --output-dir "/home/ubuntu/Downloads"

# 4. Verify output
ls -la /home/ubuntu/Downloads/5\ *.xlsx

Required: Summary Report

IMPORTANT: After splitting, always print a summary table to the user (do NOT write to file):

code
| File        | Date   | Sheet1 Rows | Sheet2 (PT) Rows |
|-------------|--------|-------------|------------------|
| 5 1612.xlsx | Dec 16 | 0           | 51               |
| 5 1712.xlsx | Dec 17 | 49,983      | 83               |
| 5 1812.xlsx | Dec 18 | 49,090      | 55               |

Generate this by reading each output file and formatting as markdown table.

After Splitting

Once files are split and validated, proceed to the 6-step pipeline with each single-day file. The pipeline scripts are in /Users/sotola/PycharmProjects/mac_local_m4/.