Instructions

Name: dataset-license-resolver
Rating: 92
Author: zjunlp

1. Understand the Request

For each dataset/model mentioned:

•
GitHub Repositories:
- •Use github-get_file_contents to examine repository structure and README
- •Look for license files (LICENSE, LICENSE.md, etc.) and documentation
•
Hugging Face Resources:
- •Use huggingface-hub_repo_details to get metadata including license tags
- •Use fetch-fetch_markdown to retrieve full README/content
- •Search for related datasets/models using huggingface-dataset_search or huggingface-model_search
•
Trace Provenance:
- •Follow references to original sources (e.g., "adopted from HuggingFaceTB team")
- •Identify base models used for synthesis (e.g., "uses DeepSeek-V2.5")

When multiple sources exist:

•Identify direct sources: For raw/derived datasets, find the original data source
•Identify synthesis models: For processed/transformed datasets, find the models used for generation
•
Compare licenses: Evaluate which is more permissive for derivative/secondary use based on:
- •Attribution requirements
- •Use-based restrictions
- •Commercial use allowances
- •Redistribution terms

Follow user's priority rules:

•If user specifies "most permissive for derivative/secondary use," compare licenses and select the one with fewer restrictions
•If user specifies "directly reuse license from source," use the identified source's license
•Document the reasoning for license selection

Based on user requirements:

•
Reply to Issues:
- •Format response exactly as specified by user
- •Include all required placeholders filled with determined licenses
•
Update Documentation:
- •Retrieve current README/content using appropriate tools
- •Append license section at the end as specified
- •Maintain existing formatting and content
•
Authentication:
- •Use provided tokens (e.g., read .hf_token file)
- •Pass tokens to API calls as needed

•License Types: Be familiar with common licenses (MIT, Apache-2.0, CC-BY, etc.) and custom licenses (DeepSeek License, The Stack Terms of Use)
•Provenance Chains: Some datasets may have multiple layers of derivation; trace back to the original source
•Format Compliance: Strictly adhere to user-specified output formats; do not modify, add, or remove content outside placeholders
•Error Handling: If license information cannot be determined, document the investigation and ask for clarification