AgentSkillsCN

awsflow-glue

使用 awsflow 管理 AWS Glue ETL 作业、触发器、爬虫,并查询数据目录(数据库、表、分区、连接)。创建并启动作业,检查爬虫历史,管理标签。

SKILL.md
--- frontmatter
name: awsflow-glue
description: Manage AWS Glue ETL jobs, triggers, crawlers, and query the Data Catalog (databases, tables, partitions, connections) using awsflow. Create and start jobs, inspect crawl history, and manage tags.

Awsflow Glue

Manage Glue ETL jobs, triggers, crawlers, and query the Data Catalog.

When to Use This Skill

Use this skill when the user:

  • Asks about Glue ETL jobs, crawlers, or triggers
  • Wants to inspect or query the Glue Data Catalog (databases, tables, partitions)
  • Needs to start a Glue job run
  • Wants to create a Glue job
  • Asks about Glue connections or job bookmarks
  • Needs to inspect crawl history

Tool: GlueTool

Execute AWS Glue commands including Data Catalog queries. ALWAYS provide params object.

Commands

ListJobs

List Glue jobs.

json
{ "command": "ListJobs", "params": { "MaxResults": 50 } }

Parameters:

ParameterTypeRequiredDescription
MaxResultsnumberNoMaximum items to return
nextTokenstringNoPagination token

GetJob

Get details of a Glue job.

json
{ "command": "GetJob", "params": { "JobName": "my-etl-job" } }

Parameters:

ParameterTypeRequiredDescription
JobNamestringYesJob name

GetJobRun

Get details of a specific job run.

json
{ "command": "GetJobRun", "params": { "JobName": "my-etl-job", "RunId": "jr_abc123" } }

Parameters:

ParameterTypeRequiredDescription
JobNamestringYesJob name
RunIdstringYesJob run ID

GetJobRuns

List all runs of a job.

json
{ "command": "GetJobRuns", "params": { "JobName": "my-etl-job", "MaxResults": 20 } }

Parameters:

ParameterTypeRequiredDescription
JobNamestringYesJob name
MaxResultsnumberNoMaximum items to return
nextTokenstringNoPagination token

GetJobBookmark

Get the bookmark state for a job.

json
{ "command": "GetJobBookmark", "params": { "JobName": "my-etl-job" } }

Parameters:

ParameterTypeRequiredDescription
JobNamestringYesJob name

StartJobRun

Start a Glue job run.

json
{ "command": "StartJobRun", "params": { "JobName": "my-etl-job", "Arguments": { "--input-path": "s3://my-bucket/input/" } } }

Parameters:

ParameterTypeRequiredDescription
JobNamestringYesJob name
ArgumentsobjectNoJob run arguments (override defaults)
TimeoutnumberNoJob timeout in minutes
MaxCapacitynumberNoMaximum DPU capacity
WorkerTypestringNoStandard, G.1X, G.2X, G.025X
NumberOfWorkersnumberNoNumber of workers
SecurityConfigurationstringNoSecurity configuration name
AllocatedCapacitynumberNoAllocated capacity
JobRunIdstringNoJob run ID

CreateJob

Create a new Glue job.

json
{
  "command": "CreateJob",
  "params": {
    "Name": "my-new-job",
    "Role": "arn:aws:iam::123456789012:role/GlueRole",
    "Command": { "Name": "glueetl", "ScriptLocation": "s3://my-bucket/scripts/etl.py" },
    "WorkerType": "G.1X",
    "NumberOfWorkers": 10
  }
}

Parameters:

ParameterTypeRequiredDescription
NamestringYesJob name
RolestringYesIAM role ARN
CommandobjectYesCommand config with Name (glueetl/pythonshell/gluestreaming) and ScriptLocation
DescriptionstringNoJob description
LogUristringNoS3 URI for job logs
DefaultArgumentsobjectNoDefault job arguments
MaxRetriesnumberNoMaximum retries
TimeoutnumberNoTimeout in minutes
MaxCapacitynumberNoMax DPU capacity
WorkerTypestringNoStandard, G.1X, G.2X, G.025X
NumberOfWorkersnumberNoNumber of workers
SecurityConfigurationstringNoSecurity config name
TagsobjectNoKey-value tags

ListTriggers

List Glue triggers.

json
{ "command": "ListTriggers", "params": { "MaxResults": 50 } }

Parameters:

ParameterTypeRequiredDescription
DependentJobNamestringNoFilter by dependent job name
MaxResultsnumberNoMaximum items
nextTokenstringNoPagination token

GetTrigger

Get details of a trigger.

json
{ "command": "GetTrigger", "params": { "Name": "my-trigger" } }

Parameters:

ParameterTypeRequiredDescription
NamestringYesTrigger name

GetTriggers

List triggers with optional filter.

json
{ "command": "GetTriggers", "params": { "DependencyJobName": "my-job" } }

Parameters:

ParameterTypeRequiredDescription
DependencyJobNamestringNoFilter by dependency job name
MaxResultsnumberNoMaximum items
nextTokenstringNoPagination token

ListCrawlers

List Glue crawlers.

json
{ "command": "ListCrawlers", "params": {} }

Parameters:

ParameterTypeRequiredDescription
MaxResultsnumberNoMaximum items
nextTokenstringNoPagination token

GetCrawler

Get details of a crawler.

json
{ "command": "GetCrawler", "params": { "CrawlerName": "my-crawler" } }

Parameters:

ParameterTypeRequiredDescription
CrawlerNamestringYesCrawler name

GetCrawlers

List crawlers with details.

json
{ "command": "GetCrawlers", "params": {} }

Parameters:

ParameterTypeRequiredDescription
MaxResultsnumberNoMaximum items
nextTokenstringNoPagination token

ListCrawls

List crawl runs for a crawler.

json
{ "command": "ListCrawls", "params": { "CrawlerName": "my-crawler" } }

Parameters:

ParameterTypeRequiredDescription
CrawlerNamestringYesCrawler name
MaxResultsnumberNoMaximum items
nextTokenstringNoPagination token

GetDatabase

Get a Glue Data Catalog database.

json
{ "command": "GetDatabase", "params": { "DatabaseName": "my-database" } }

Parameters:

ParameterTypeRequiredDescription
DatabaseNamestringYesDatabase name
CatalogIdstringNoCatalog ID (AWS account ID)

GetDatabases

List all databases in the Data Catalog.

json
{ "command": "GetDatabases", "params": {} }

Parameters:

ParameterTypeRequiredDescription
CatalogIdstringNoCatalog ID

GetTable

Get a table definition from the Data Catalog.

json
{ "command": "GetTable", "params": { "DatabaseName": "my-database", "Name": "my-table" } }

Parameters:

ParameterTypeRequiredDescription
DatabaseNamestringYesDatabase name
TableNamestringYesTable name
CatalogIdstringNoCatalog ID

GetTables

List tables in a database.

json
{ "command": "GetTables", "params": { "DatabaseName": "my-database" } }

Parameters:

ParameterTypeRequiredDescription
DatabaseNamestringYesDatabase name
CatalogIdstringNoCatalog ID

GetPartitions

List partitions for a table.

json
{ "command": "GetPartitions", "params": { "DatabaseName": "my-database", "TableName": "my-table" } }

Parameters:

ParameterTypeRequiredDescription
DatabaseNamestringYesDatabase name
TableNamestringYesTable name
ExpressionstringNoPartition filter expression
CatalogIdstringNoCatalog ID
SegmentobjectNoSegment config for parallel scanning

GetConnections

List or get Glue connections.

json
{ "command": "GetConnections", "params": { "ConnectionName": "my-connection" } }

Parameters:

ParameterTypeRequiredDescription
ConnectionNamestringNoConnection name
HidePasswordbooleanNoHide connection password
CatalogIdstringNoCatalog ID

GetTags

Get tags for a Glue resource.

json
{ "command": "GetTags", "params": { "ResourceArn": "arn:aws:glue:..." } }

Parameters:

ParameterTypeRequiredDescription
ResourceArnstringYesResource ARN

Related Services

  • Glue → CloudWatch Logs: Glue job output logs go to /aws-glue/jobs/output and error logs to /aws-glue/jobs/error. Crawler logs go to /aws-glue/crawlers. Use CloudWatchLogTool to read them
  • Glue → S3: Glue jobs read from and write to S3. Crawler targets are often S3 paths. Job scripts are stored in S3. Use S3Tool to inspect
  • Glue → IAM: Jobs require IAM roles. Use IAMTool to inspect the execution role
  • Glue → Data Catalog → Athena/EMR/Redshift: The Glue Data Catalog is shared with Athena, EMR, and Redshift Spectrum
  • Glue → CloudFormation: Glue resources managed by CloudFormation stacks
  • Glue → DynamoDB: Glue can read from DynamoDB tables as data sources