SQL Query Generation
This skill enables an AI agent to translate natural language questions into correct, efficient SQL queries. The agent maps user intent to the appropriate query constructs — joins, aggregations, window functions, CTEs, and subqueries — while respecting the target database schema. It also analyzes query performance with EXPLAIN plans and recommends optimizations such as indexing, predicate pushdown, and query restructuring.
Workflow
- •
Parse the natural language request. Extract the analytical intent: what metric is being asked for, which entities are involved, what filters apply, and how results should be ordered or grouped. Distinguish between requests for aggregated summaries versus row-level detail.
- •
Map to the database schema. Identify the relevant tables and columns from the schema. Resolve ambiguous references (e.g., "sales" could mean the
orderstable or therevenuecolumn). Determine the join path between tables using foreign key relationships, avoiding unnecessary joins that inflate result sets. - •
Select the appropriate query constructs. Choose between simple aggregation, window functions, CTEs, or subqueries based on complexity. Use CTEs for multi-step calculations to improve readability. Use window functions for running totals, rankings, and comparisons within partitions. Prefer explicit JOINs over implicit comma-separated joins.
- •
Generate the SQL query. Write syntactically correct SQL with consistent formatting: uppercase keywords, lowercase identifiers, aliased tables, and indented clauses. Include comments for complex logic. Always specify column aliases for computed expressions.
- •
Validate and optimize. Run EXPLAIN (or EXPLAIN ANALYZE) on the generated query to inspect the execution plan. Look for full table scans, hash joins on large tables, and sort operations on unindexed columns. Recommend indexes or query rewrites when the estimated cost is high.
- •
Return results with explanation. Present the query alongside a plain-language explanation of what it does, the expected output format, and any assumptions made about the schema or data.
Supported Technologies
- •PostgreSQL — CTEs, window functions, LATERAL joins, EXPLAIN ANALYZE
- •MySQL — common table expressions (8.0+), window functions (8.0+), EXPLAIN FORMAT=JSON
- •SQLite — lightweight queries, window functions (3.25+)
- •SQL Server — T-SQL extensions, cross apply, query store analysis
- •BigQuery — STRUCT/ARRAY types, QUALIFY clause, slot-based cost estimation
Usage
Provide the agent with a natural language question and the database schema (as DDL, a schema diagram, or a list of tables and columns). Optionally specify the target SQL dialect, any performance constraints, and whether you want the EXPLAIN analysis included.
Examples
Example 1: Complex analytics query from natural language
Request: "Show the top 5 customers by total spending in the last 12 months, along with their most frequently ordered product category and how their spending compares to the previous 12-month period."
WITH current_period AS (
SELECT
c.customer_id,
c.customer_name,
SUM(oi.quantity * oi.unit_price) AS total_spent,
COUNT(DISTINCT o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY c.customer_id, c.customer_name
),
previous_period AS (
SELECT
c.customer_id,
SUM(oi.quantity * oi.unit_price) AS prev_total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '24 months'
AND o.order_date < CURRENT_DATE - INTERVAL '12 months'
GROUP BY c.customer_id
),
top_categories AS (
SELECT DISTINCT ON (c.customer_id)
c.customer_id,
pc.category_name AS favorite_category,
COUNT(*) AS category_orders
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN product_categories pc ON p.category_id = pc.category_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY c.customer_id, pc.category_name
ORDER BY c.customer_id, COUNT(*) DESC
)
SELECT
cp.customer_name,
cp.total_spent,
cp.order_count,
tc.favorite_category,
pp.prev_total_spent,
ROUND(
(cp.total_spent - COALESCE(pp.prev_total_spent, 0))
/ NULLIF(pp.prev_total_spent, 0) * 100, 1
) AS spending_change_pct
FROM current_period cp
LEFT JOIN previous_period pp ON cp.customer_id = pp.customer_id
LEFT JOIN top_categories tc ON cp.customer_id = tc.customer_id
ORDER BY cp.total_spent DESC
LIMIT 5;
-- Expected output:
-- customer_name | total_spent | order_count | favorite_category | prev_total_spent | spending_change_pct
-- Acme Corp | 284,500.00 | 47 | Electronics | 198,200.00 | 43.5
-- GlobalTech | 231,800.00 | 38 | Software | 245,100.00 | -5.4
-- ...
Example 2: Optimizing a slow query with EXPLAIN analysis
Original slow query (takes 12.4 seconds on 5M rows):
SELECT product_name, SUM(quantity * unit_price) AS revenue FROM order_items oi, products p, orders o WHERE oi.product_id = p.product_id AND oi.order_id = o.order_id AND o.order_date BETWEEN '2024-01-01' AND '2024-12-31' GROUP BY product_name ORDER BY revenue DESC;
EXPLAIN ANALYZE output (problem indicators):
Seq Scan on orders o (cost=0.00..98456.00 rows=1245000) Filter: (order_date >= '2024-01-01' AND order_date <= '2024-12-31') Rows Removed by Filter: 3755000 Hash Join (cost=98456.00..245678.00 rows=3200000) Sort (cost=312456.00..312460.00 rows=8500) Sort Method: external merge Disk: 4096kB
Issues identified:
- •Sequential scan on
orders— no index onorder_date - •Implicit join syntax hides join order from optimizer
- •Sort spilling to disk due to insufficient
work_mem
Optimized query:
-- Step 1: Create index (one-time)
CREATE INDEX idx_orders_date ON orders (order_date)
INCLUDE (order_id);
-- Step 2: Rewrite with explicit joins and date index hint
SELECT
p.product_name,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY p.product_name
ORDER BY revenue DESC;
-- After optimization: 0.34 seconds (36x faster)
-- EXPLAIN now shows:
-- Index Scan on idx_orders_date (rows=1245000, actual=1243892)
-- Merge Join (cost reduced by 85%)
-- Sort Method: quicksort Memory: 512kB
Best Practices
- •Always use explicit JOIN syntax instead of comma-separated implicit joins — it makes intent clear and prevents accidental cross joins.
- •Alias every table and every computed column for readability and to avoid ambiguity in complex queries.
- •Use CTEs to break complex queries into named, logical steps rather than deeply nesting subqueries.
- •Filter early: place WHERE conditions on the driving table to reduce the dataset before joins amplify row counts.
- •Prefer
COUNT(DISTINCT col)overCOUNT(*)on joined tables to avoid inflated counts from one-to-many relationships. - •Always test generated queries against the actual schema before presenting them as final — column names and types in natural language descriptions often differ from the real DDL.
Edge Cases
- •Ambiguous column names. When multiple tables have a column with the same name (e.g.,
id,name,status), always qualify with the table alias. Prompt the user for clarification if the natural language request is genuinely ambiguous. - •NULL handling in aggregations.
SUM,AVG, andCOUNT(col)silently ignore NULLs. When NULLs are meaningful (e.g., "no sale"), useCOALESCE(col, 0)before aggregating and note the assumption. - •Division by zero in calculated metrics. Wrap denominators with
NULLIF(denominator, 0)to return NULL instead of an error, then handle the NULL in the presentation layer. - •Date/time zone mismatches. When filtering by date on a timestamp column, be explicit about boundaries: use
>= '2024-01-01' AND < '2025-01-01'instead ofBETWEEN, which includes the end boundary's midnight. - •Very large result sets. Always include
LIMITin exploratory queries. For production queries, add pagination withOFFSET/FETCHor keyset pagination for better performance on deep pages.