Dataset Management Patterns

Reference patterns for creating and managing Dataiku datasets via the Python API.

Dataset Types

Type	Use When	Creation Method
Managed	Output of recipes, stored in a connection (SQL, HDFS, etc.)	`project.new_managed_dataset(name)`
Uploaded	Importing local files (CSV, Excel, etc.)	`project.create_dataset(name, "UploadedFiles", ...)`
SQL Table	Pointing to an existing database table	`project.create_dataset(name, "Snowflake", ...)`

Create a Managed Dataset

python

builder = project.new_managed_dataset("MY_OUTPUT")
builder.with_store_into("connection_name")
ds = builder.create()

# Configure table location (SQL databases)
settings = ds.get_settings()
raw = settings.get_raw()
raw["params"]["schema"] = "MY_SCHEMA"
raw["params"]["table"] = "MY_OUTPUT"
settings.save()

Upload a File

python

ds = project.create_dataset(
    "my_dataset", "UploadedFiles",
    params={"uploadConnection": "filesystem_managed"}
)
ds.uploaded_add_file("path/to/data.csv")

# Auto-detect schema from file contents
settings = ds.get_settings()
settings.autodetect_settings(infer_storage_types=True)
settings.save()

Common Column Types

Dataiku Type	Description
`string`	Text
`int` / `bigint`	Integer / Large integer
`double` / `float`	Decimal numbers
`boolean`	True/False
`date`	Date only

See references/column-types.md for the full type table.

Core Schema Operations

Get Schema

python

ds = project.get_dataset("my_dataset")
schema = ds.get_settings().get_schema()
for col in schema["columns"]:
    print(f"{col['name']}: {col['type']}")

Set Schema

python

settings = ds.get_settings()
settings.set_schema({"columns": [
    {"name": "id", "type": "string"},
    {"name": "amount", "type": "double"},
]})
settings.save()

Auto-detect Schema

python

dataset.autodetect_settings()
settings = dataset.get_settings()
settings.save()

See references/schema-operations.md for join compatibility checks, helper functions, and advanced operations.

SQL Schema Rule

Output datasets for SQL-based recipes MUST have schemas set before building. Without this, Dataiku generates CREATE TABLE () ... which fails.

For SQL databases (Snowflake, BigQuery), use UPPERCASE column names. Lowercase names get quoted, causing "invalid identifier" errors.

python

# Normalize column names to uppercase for SQL
raw = settings.get_raw()
for col in raw.get("schema", {}).get("columns", []):
    col["name"] = col["name"].upper()
settings.save()

List Datasets in Project

python

datasets = project.list_datasets()
for ds in datasets:
    print(f"- {ds['name']} ({ds.get('type', 'unknown')})")

Common Issues

Issue	Cause	Solution
Schema mismatch	Recipe output doesn't match	Run `autodetect_settings()`
Join fails	Key type mismatch	Check types, cast if needed
Missing columns	Schema not updated	Rebuild dataset, update schema
Parse errors	Wrong type detection	Manually set schema

Detailed References

•references/column-types.md — Full column type table with Python equivalents
•references/schema-operations.md — All schema operations, join compatibility checks, helper functions