Hugging Face Dataset Viewer

Name: huggingface-datasets
Rating: 78
Author: huggingface

Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.

Core workflow

•Optionally validate dataset availability with /is-valid.
•Resolve config + split with /splits.
•Preview with /first-rows.
•Paginate content with /rows using offset and length (max 100).
•Use /search for text matching and /filter for row predicates.
•Retrieve parquet links via /parquet and totals/metadata via /size and /statistics.

Defaults

•Base URL: https://datasets-server.huggingface.co
•Default API method: GET
•Query params should be URL-encoded.
•offset is 0-based.
•length max is usually 100 for row-like endpoints.
•Gated/private datasets require Authorization: Bearer <HF_TOKEN>.

Dataset Viewer

•Validate dataset: /is-valid?dataset=<namespace/repo>
•List subsets and splits: /splits?dataset=<namespace/repo>
•Preview first rows: /first-rows?dataset=<namespace/repo>&config=<config>&split=<split>
•Paginate rows: /rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>
•Search text: /search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>
•Filter with predicates: /filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>
•List parquet shards: /parquet?dataset=<namespace/repo>
•Get size totals: /size?dataset=<namespace/repo>
•Get column statistics: /statistics?dataset=<namespace/repo>&config=<config>&split=<split>
•Get Croissant metadata (if available): /croissant?dataset=<namespace/repo>

Pagination pattern:

bash

curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"

When pagination is partial, use response fields such as num_rows_total, num_rows_per_page, and partial to drive continuation logic.

Search/filter notes:

•/search matches string columns (full-text style behavior is internal to the API).
•/filter requires predicate syntax in where and optional sort in orderby.
•Keep filtering and searches read-only and side-effect free.

Querying Datasets

Use npx parquetlens with Hub parquet alias paths for SQL querying.

Parquet alias shape:

text

hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet

Derive <config>, <split>, and <shard> from Dataset Viewer /parquet:

bash

curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
  | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'

Run SQL query:

bash

npx -y -p parquetlens -p @parquetlens/sql parquetlens \
  "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
  --sql "SELECT * FROM data LIMIT 20"

SQL export

•CSV: --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"
•JSON: --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"
•Parquet: --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"

Creating and Uploading Datasets

Use one of these flows depending on dependency constraints.

Zero local dependencies (Hub UI):

•Create dataset repo in browser: https://huggingface.co/new-dataset
•Upload parquet files in the repo "Files and versions" page.
•Verify shards appear in Dataset Viewer:

bash

curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"

Low dependency CLI flow (npx @huggingface/hub / hfjs):

•Set auth token:

bash

export HF_TOKEN=<your_hf_token>

•Upload parquet folder to a dataset repo (auto-creates repo if missing):

bash

npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data

•Upload as private repo on creation:

bash

npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private

After upload, call /parquet to discover <config>/<split>/<shard> values for querying with @~parquet.