AgentSkillsCN

dataset-prep

提供实用工具,将原始交互数据转换为 pyKT 格式,并验证数据集的完整性。当您准备为知识追踪实验构建新数据集时,可选用此技能。

SKILL.md
--- frontmatter
name: dataset-prep
description: Utilities for converting raw interaction data to pyKT format and validating dataset integrity. Use when preparing new datasets for Knowledge Tracing experiments.

pyKT Dataset Preparation

Prepare and validate datasets for Knowledge Tracing experiments with pyKT framework.

Quick Start

Check Preprocessing Status

bash
python scripts/preprocess.py --status --data-dir ../data

Preprocess a Dataset

bash
# Single dataset
python scripts/preprocess.py --dataset assist2009 --pykt-path ../pykt-toolkit

# Multiple datasets
python scripts/preprocess.py --dataset assist2009 assist2015 ednet

# All available datasets
python scripts/preprocess.py --all

Validate Dataset

bash
# Validate entire dataset directory
python scripts/validate_dataset.py --dir ../data/assist2009

# Validate specific file
python scripts/validate_dataset.py --file ../data/assist2009/data.txt --type raw
python scripts/validate_dataset.py --file ../data/assist2009/train_valid_sequences.csv --type csv

Scripts

preprocess.py

Wrapper for pyKT's preprocessing pipeline with batch support.

OptionDescription
--dataset NAMEDataset(s) to preprocess
--allProcess all available datasets
--statusShow preprocessing status only
--data-dir PATHData directory location
--pykt-path PATHpykt-toolkit installation path
--min-seq-len NMinimum sequence length (default: 3)
--maxlen NMaximum sequence length (default: 200)
--kfold NNumber of CV folds (default: 5)
--listList supported datasets with download URLs

validate_dataset.py

Validate data format and report statistics.

OptionDescription
--file PATHSingle file to validate
--dir PATHDataset directory to validate
--type raw/csvFile type (auto-detected)
--jsonOutput results as JSON

Supported Datasets

DatasetTypeDescription
assist2009Q+CASSISTments 2009-2010 Math
assist2012Q+CASSISTments 2012-2013
assist2015CASSISTments Skill Builder
assist2017Q+CASSISTments Competition
algebra2005Q+CKDD Cup Algebra
bridge2algebra2006Q+CKDD Cup Bridge to Algebra
statics2011CAndes Physics
nips_task34Q+CEedi Education Challenge
ednetQ+CTOEIC English (Riiid)
junyi2015Q+CJunyi Academy K-12 Math
slepemapyQ+CGeography
pojCProgramming Judge

Type: Q+C = Questions + Concepts, C = Concepts only

References

Workflow

  1. Download raw data from source (see --list for URLs)
  2. Place in pykt-toolkit/data/{dataset_name}/
  3. Preprocess with preprocess.py
  4. Validate with validate_dataset.py
  5. Train models using pyKT