Data Extraction
Extract specific information from unstructured/semi-structured data with completeness and accuracy.
Common Patterns
| Type | Pattern | Validation |
|---|---|---|
user@domain.ext | Has @ and . after @ | |
| URL | http(s)://domain... | Valid protocol and domain |
| Date | ISO, US, EU, timestamp | Valid ranges (month 1-12) |
| Phone | Various formats | 7-15 digits |
| IP | IPv4: x.x.x.x, IPv6 | Octets 0-255 |
| Key-Value | key=value, key: value | Handle quoted/nested |
Process
- •Analyze: Format, delimiters, variations, headers to skip
- •Extract: Match all instances, capture context, handle partial matches
- •Clean: Trim, normalize (dates to ISO, phones to digits), validate
- •Format: Consistent fields, proper escaping, sort/dedupe if needed
Output Formats
JSON: {"results": [...], "summary": {"total": N, "unique": N}}
CSV: Headers + rows
Markdown: Table with headers
Plain: Bullet list
Principles
- •Complete: Extract ALL matches, don't stop early
- •Accurate: Preserve exact values, maintain case
- •Handle edge cases: Missing → null, malformed → flag, duplicates → note
Output Structure
code
[Extracted data] ## Summary - Total: X - Unique: Y - Issues: Z ## Notes - Line 42: Partial match "user@" (missing domain)