Instructions
1. Understand the Request
- •Identify the target datasets (names, IDs, or search queries).
- •Locate and read any provided format specification (e.g.,
unified_format.md). - •Clarify the scope: number of entries per dataset, output file name/location.
2. Search and Validate Datasets
- •Use
huggingface-dataset_searchto find each dataset. - •Use
huggingface-hub_repo_detailsto get metadata and confirm availability. - •Note the dataset size and structure.
3. Analyze Dataset Structures
- •Load a sample from each dataset using
datasets.load_dataset. - •Examine the column names and a few sample entries.
- •Identify the native format (e.g., ToolACE uses
system/conversations, Glaive usesconversations/tools, XLAM usesquery/answers/tools).
4. Convert to Unified Format
- •Always refer to the provided
unified_format.md(or equivalent) for the target schema. - •Key Rules:
- •
conversation_id: Format as{source_short_name}_{index}(e.g.,toolace_0). - •
messages: List of messages withrole(user/assistant/tool),content, and optionallytool_callsortool_call_id. - •
tool_calls: For assistant messages, includeid(formattool_call_{n}),name,arguments. - •
tools: List of normalized tool definitions. - •Remove system messages if they only contain tool instructions.
- •If assistant only made tool calls, set
contenttonull. - •Do not add tool return results or assistant replies not in the original data.
- •
- •Use the bundled
scripts/convert_and_merge.pyfor reliable, deterministic conversion of ToolACE, Glaive, and XLAM formats. - •For new dataset formats, write a custom converter following the patterns in the script.
5. Merge and Output
- •Limit entries per dataset as requested (default: first 500).
- •Merge all converted entries into a single list.
- •Write to a JSONL file in the workspace (default:
unified_tool_call.jsonl). - •Verify the output: count entries, check file size, validate format against the specification.
6. Finalize
- •Provide a summary: datasets found, entries converted, output location.
- •Optionally, show a sample entry from each source for validation.
- •Confirm the task is complete.