AgentSkillsCN

dataset-management

在创建数据集、上传文件、管理数据模式或配置数据集连接时使用。

SKILL.md
--- frontmatter
name: dataset-management
description: "Use when creating datasets, uploading files, managing schemas, or configuring dataset connections"

Dataset Management Patterns

Reference patterns for creating and managing Dataiku datasets via the Python API.

Dataset Types

TypeUse WhenCreation Method
ManagedOutput of recipes, stored in a connection (SQL, HDFS, etc.)project.new_managed_dataset(name)
UploadedImporting local files (CSV, Excel, etc.)project.create_dataset(name, "UploadedFiles", ...)
SQL TablePointing to an existing database tableproject.create_dataset(name, "Snowflake", ...)

Create a Managed Dataset

python
builder = project.new_managed_dataset("MY_OUTPUT")
builder.with_store_into("connection_name")
ds = builder.create()

# Configure table location (SQL databases)
settings = ds.get_settings()
raw = settings.get_raw()
raw["params"]["schema"] = "MY_SCHEMA"
raw["params"]["table"] = "MY_OUTPUT"
settings.save()

Upload a File

python
ds = project.create_dataset(
    "my_dataset", "UploadedFiles",
    params={"uploadConnection": "filesystem_managed"}
)
ds.uploaded_add_file("path/to/data.csv")

# Auto-detect schema from file contents
settings = ds.get_settings()
settings.autodetect_settings(infer_storage_types=True)
settings.save()

Common Column Types

Dataiku TypeDescription
stringText
int / bigintInteger / Large integer
double / floatDecimal numbers
booleanTrue/False
dateDate only

See references/column-types.md for the full type table.

Core Schema Operations

Get Schema

python
ds = project.get_dataset("my_dataset")
schema = ds.get_settings().get_schema()
for col in schema["columns"]:
    print(f"{col['name']}: {col['type']}")

Set Schema

python
settings = ds.get_settings()
settings.set_schema({"columns": [
    {"name": "id", "type": "string"},
    {"name": "amount", "type": "double"},
]})
settings.save()

Auto-detect Schema

python
dataset.autodetect_settings()
settings = dataset.get_settings()
settings.save()

See references/schema-operations.md for join compatibility checks, helper functions, and advanced operations.

SQL Schema Rule

Output datasets for SQL-based recipes MUST have schemas set before building. Without this, Dataiku generates CREATE TABLE () ... which fails.

For SQL databases (Snowflake, BigQuery), use UPPERCASE column names. Lowercase names get quoted, causing "invalid identifier" errors.

python
# Normalize column names to uppercase for SQL
raw = settings.get_raw()
for col in raw.get("schema", {}).get("columns", []):
    col["name"] = col["name"].upper()
settings.save()

List Datasets in Project

python
datasets = project.list_datasets()
for ds in datasets:
    print(f"- {ds['name']} ({ds.get('type', 'unknown')})")

Common Issues

IssueCauseSolution
Schema mismatchRecipe output doesn't matchRun autodetect_settings()
Join failsKey type mismatchCheck types, cast if needed
Missing columnsSchema not updatedRebuild dataset, update schema
Parse errorsWrong type detectionManually set schema

Detailed References