Hugging Face Dataset Viewer
Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
Core workflow
- •Optionally validate dataset availability with
/is-valid. - •Resolve
config+splitwith/splits. - •Preview with
/first-rows. - •Paginate content with
/rowsusingoffsetandlength(max 100). - •Use
/searchfor text matching and/filterfor row predicates. - •Retrieve parquet links via
/parquetand totals/metadata via/sizeand/statistics.
Defaults
- •Base URL:
https://datasets-server.huggingface.co - •Default API method:
GET - •Query params should be URL-encoded.
- •
offsetis 0-based. - •
lengthmax is usually100for row-like endpoints. - •Gated/private datasets require
Authorization: Bearer <HF_TOKEN>.
Dataset Viewer
- •
Validate dataset:/is-valid?dataset=<namespace/repo> - •
List subsets and splits:/splits?dataset=<namespace/repo> - •
Preview first rows:/first-rows?dataset=<namespace/repo>&config=<config>&split=<split> - •
Paginate rows:/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int> - •
Search text:/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int> - •
Filter with predicates:/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int> - •
List parquet shards:/parquet?dataset=<namespace/repo> - •
Get size totals:/size?dataset=<namespace/repo> - •
Get column statistics:/statistics?dataset=<namespace/repo>&config=<config>&split=<split> - •
Get Croissant metadata (if available):/croissant?dataset=<namespace/repo>
Pagination pattern:
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100" curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
When pagination is partial, use response fields such as num_rows_total, num_rows_per_page, and partial to drive continuation logic.
Search/filter notes:
- •
/searchmatches string columns (full-text style behavior is internal to the API). - •
/filterrequires predicate syntax inwhereand optional sort inorderby. - •Keep filtering and searches read-only and side-effect free.
Querying Datasets
Use npx parquetlens with Hub parquet alias paths for SQL querying.
Parquet alias shape:
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
Derive <config>, <split>, and <shard> from Dataset Viewer /parquet:
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \ | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
Run SQL query:
npx -y -p parquetlens -p @parquetlens/sql parquetlens \ "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \ --sql "SELECT * FROM data LIMIT 20"
SQL export
- •CSV:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" - •JSON:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" - •Parquet:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
Creating and Uploading Datasets
Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
- •Create dataset repo in browser:
https://huggingface.co/new-dataset - •Upload parquet files in the repo "Files and versions" page.
- •Verify shards appear in Dataset Viewer:
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
Low dependency CLI flow (npx @huggingface/hub / hfjs):
- •Set auth token:
export HF_TOKEN=<your_hf_token>
- •Upload parquet folder to a dataset repo (auto-creates repo if missing):
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
- •Upload as private repo on creation:
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
After upload, call /parquet to discover <config>/<split>/<shard> values for querying with @~parquet.