DB23 Excel Split Guide

Split multi-day A5 Excel workbooks into single-day files that pass through the 6-step data ingestion pipeline.

Prerequisites

The 6-step pipeline only accepts single-day Excel files. Multi-day files must be split first. This skill covers the prerequisite splitting and validation step.

Script Location

code

/Users/sotola/PycharmProjects/db23/ai/db23/split_excel_by_day.py

Quick Start

bash

cd /Users/sotola/PycharmProjects/db23 && \
python ai/db23/split_excel_by_day.py \
  --input "/home/ubuntu/Downloads/multi-day/YOUR_FILE.xlsx" \
  --output-dir "/home/ubuntu/Downloads"

Output: Files named 5 DDMM.xlsx (e.g., 5 2711.xlsx for Nov 27)

Arguments

Argument	Required	Description
`--input`	Yes	Path to multi-day Excel file
`--output-dir`	Yes	Directory for output files
`--template`	No	Template file to preserve column order
`--prefix`	No	Filename prefix (default: `5`)

Required Sheet Structure

Sheet1 (Regular Orders)

Sheet name must be: Sheet1

Required columns:

Column	Type	Description	Validation
`ORDER_DATE`	date	Trading date	Must be valid date, used for splitting
`SECURITYSYMBOL`	string	Stock symbol	Must be known stock (HPG, SSI, STB, etc.)
`CUSTOMERID`	string	Account number	Non-empty, becomes `stk` field
`FULLNAME`	string	Customer name	Can have missing values
`IDENTIFICATION`	string	Citizen ID (CCCD/CMND)	Can have missing values
`ADDRESS`	string	Address	Can have missing values
`ORDERSIDE`	string	Buy/Sell indicator	`B` or `S`
`SUM_O_VOL`	int	Order volume	Non-negative integer
`SUM_M_VOL`	int	Matched volume	Non-negative integer, ≤ SUM_O_VOL
`AVG_PRICE`	float	Average price	Positive number
`DEAL_VAL`	float	Deal value	Non-negative

Sheet2 (PT/Thỏa thuận)

Sheet name must be: Sheet2

Required columns:

Column	Type	Description	Validation
`PT_DATE`	date	PT trading date	Must be valid date
`SECURITYSYMBOL`	string	Stock symbol	Must match known stocks
`BUYERCUSTOMERID`	string	Buyer account	Non-empty
`FULLNAME`	string	Buyer name	Can have missing values
`IDENTIFICATION`	string	Buyer ID	Can have missing values
`ADDRESS`	string	Buyer address	Can have missing values
`SELLERCUSTOMERID`	string	Seller account	Non-empty
`FULLNAME1`	string	Seller name	Can have missing values
`IDENTIFICATION1`	string	Seller ID	Can have missing values
`ADDRESS1`	string	Seller address	Can have missing values
`M_VOL`	int	Matched volume	Positive integer

Note: Sheet2 must exist even if empty (no PT transactions for a day).

Validation Checklist

Before running the split script, validate the input file:

1. Check Sheet Names

python

import pandas as pd
xls = pd.ExcelFile('/path/to/file.xlsx')
print(f"Sheets: {xls.sheet_names}")
# Must output: ['Sheet1', 'Sheet2']

2. Check Sheet1 Columns

python

df1 = pd.read_excel('/path/to/file.xlsx', sheet_name='Sheet1')
required = ['ORDER_DATE', 'SECURITYSYMBOL', 'CUSTOMERID', 'FULLNAME',
            'IDENTIFICATION', 'ADDRESS', 'ORDERSIDE', 'SUM_O_VOL',
            'SUM_M_VOL', 'AVG_PRICE', 'DEAL_VAL']
missing = [c for c in required if c not in df1.columns]
print(f"Missing Sheet1 columns: {missing}")  # Should be empty

3. Check Sheet2 Columns

python

df2 = pd.read_excel('/path/to/file.xlsx', sheet_name='Sheet2')
required = ['PT_DATE', 'SECURITYSYMBOL', 'BUYERCUSTOMERID', 'FULLNAME',
            'IDENTIFICATION', 'ADDRESS', 'SELLERCUSTOMERID', 'FULLNAME1',
            'IDENTIFICATION1', 'ADDRESS1', 'M_VOL']
missing = [c for c in required if c not in df2.columns]
print(f"Missing Sheet2 columns: {missing}")  # Should be empty

4. Check Date Column Values

python

# Sheet1 dates
dates1 = pd.to_datetime(df1['ORDER_DATE'], errors='coerce')
print(f"Sheet1 date range: {dates1.min()} to {dates1.max()}")
print(f"Sheet1 invalid dates: {dates1.isna().sum()}")

# Sheet2 dates (if not empty)
if len(df2) > 0:
    dates2 = pd.to_datetime(df2['PT_DATE'], errors='coerce')
    print(f"Sheet2 date range: {dates2.min()} to {dates2.max()}")

5. Check Stock Symbols

python

KNOWN_STOCKS = {'HPG', 'SSI', 'STB', 'TCB', 'VPB', 'DXG', 'MBB', 'MSN',
                'MWG', 'VIC', 'VND', 'NVL', 'HDB', 'EIB', 'VRE', 'PDR',
                'LPB', 'VHM', 'CTG', 'FPT', 'VCB', 'HSG', 'VNM', 'ACB',
                'DIG', 'PVD', 'MSB', 'PNJ', 'FRT', 'GAS', 'KBC', 'DGC', 'SHB'}

stocks = set(df1['SECURITYSYMBOL'].dropna().unique())
unknown = stocks - KNOWN_STOCKS
print(f"Unknown stocks: {unknown}")  # Should be empty

6. Verify Output Files

After splitting, verify each output file:

python

import os
output_dir = '/home/ubuntu/Downloads'
for f in os.listdir(output_dir):
    if f.startswith('5 ') and f.endswith('.xlsx'):
        path = os.path.join(output_dir, f)
        xls = pd.ExcelFile(path)
        df1 = pd.read_excel(xls, sheet_name='Sheet1')
        df2 = pd.read_excel(xls, sheet_name='Sheet2')
        print(f"{f}: Sheet1={len(df1)} rows, Sheet2={len(df2)} rows")

Common Issues

Issue	Solution
Missing `ORDER_DATE` column	Check column names for typos or extra spaces
Sheet named `sheet1` (lowercase)	Rename to `Sheet1`
Multiple date formats	Standardize to `YYYY-MM-DD` or `DD/MM/YYYY`
Missing Sheet2	Create empty Sheet2 with required column headers
Unknown stock symbols	Map to known symbols or flag for review

Example Workflow

bash

# 1. List files in multi-day folder
ls /home/ubuntu/Downloads/multi-day/

# 2. Validate a file before splitting
cd /Users/sotola/PycharmProjects/db23 && \
python -c "
import pandas as pd
xls = pd.ExcelFile('/home/ubuntu/Downloads/multi-day/5 273110.xlsx')
print('Sheets:', xls.sheet_names)
df1 = pd.read_excel(xls, sheet_name='Sheet1')
print('Sheet1 columns:', list(df1.columns))
print('Date range:', pd.to_datetime(df1['ORDER_DATE']).min(), 'to', pd.to_datetime(df1['ORDER_DATE']).max())
"

# 3. Run the split
python ai/db23/split_excel_by_day.py \
  --input "/home/ubuntu/Downloads/multi-day/5 273110.xlsx" \
  --output-dir "/home/ubuntu/Downloads"

# 4. Verify output
ls -la /home/ubuntu/Downloads/5\ *.xlsx

Required: Summary Report

IMPORTANT: After splitting, always print a summary table to the user (do NOT write to file):

code

| File        | Date   | Sheet1 Rows | Sheet2 (PT) Rows |
|-------------|--------|-------------|------------------|
| 5 1612.xlsx | Dec 16 | 0           | 51               |
| 5 1712.xlsx | Dec 17 | 49,983      | 83               |
| 5 1812.xlsx | Dec 18 | 49,090      | 55               |

Generate this by reading each output file and formatting as markdown table.

After Splitting

Once files are split and validated, proceed to the 6-step pipeline with each single-day file. The pipeline scripts are in /Users/sotola/PycharmProjects/mac_local_m4/.