AgentSkillsCN

lexer-generator

根据正则表达式规范生成词法分析器。适用场景:(1) 构建编译器;(2) 实现解释器;(3) 创建面向特定领域的语言。

SKILL.md
--- frontmatter
name: lexer-generator
description: 'Generates lexical analyzers from regex specifications. Use when: (1)
  Building compilers, (2) Implementing interpreters, (3) Creating domain-specific
  languages.'
version: 1.0.0
tags:
- compiler
- lexing
- parsing
- frontend
difficulty: beginner
languages:
- python
- ocaml
- rust
dependencies: []

Lexer Generator

Generates lexical analyzers (tokenizers) from regex specifications.

When to Use

  • Building compilers or interpreters
  • Creating lexers for DSLs
  • Parsing structured text formats
  • Implementing programming languages

What This Skill Does

  1. Parses regex specs - Converts regex to NFA/DFA
  2. Generates lexers - Produces executable tokenizer code
  3. Handles conflicts - Resolves longest-match ambiguities
  4. Optimizes - Minimizes DFA states

Key Concepts

ConceptDescription
TokenLexical unit (keyword, identifier, number)
LexemeActual source text matched
PositionLocation info for error reporting
Longest matchPrefer longest matching lexeme
PriorityFirst matching rule wins (tie-break)

Tips

  • Handle unicode properly (use Unicode flag)
  • Track line/column for error messages
  • Skip comments and whitespace
  • Handle reserved words (keyword vs identifier)
  • Consider lookahead for better error recovery

Common Issues

IssueSolution
Ambiguous patternsUse longest match + priority
PerformanceUse DFA, not backtracking
UnicodeUse re.UNICODE flag
String literalsUse greedy matching carefully

Related Skills

  • parser-generator - Generate parsers
  • program-transformer - Build ASTs
  • lambda-calculus-interpreter - Execute programs

Canonical References

ReferenceWhy It Matters
Aho, Lam, Sethi, Ullman, "Compilers: Principles, Techniques, and Tools"Dragon Book; chapters 3-4 on lexical analysis
Thompson, "Programming Techniques: Regular Expression Search Algorithm"Original NFA construction
McNaughton & Yamada, "Regular Expressions and State Graphs"DFA construction
Fischer & LeBlanc, "Crafting a Compiler"Practical lexer design
Russ Cox, "Regular Expression Matching Can Be Simple and Fast"Modern regex implementation

Tradeoffs and Limitations

Lexer Algorithm Tradeoffs

ApproachProsCons
Regex-basedEasy to writeMay backtrack
NFA-basedSimple constructionLarge states
DFA-basedFast matchingLarge table

When NOT to Build a Custom Lexer

  • For simple formats: Use existing parsers (JSON, CSV)
  • For prototyping: String splitting may suffice
  • For existing tools: Lex/Flex generate efficient lexers

Complexity Considerations

  • DFA construction: O(n²) worst case in states
  • Matching: O(1) per character with DFA
  • NFA simulation: O(n) per character

Limitations

  • Ambiguity: Keywords vs identifiers, longest match rule
  • Context sensitivity: Can't handle indentation (use parser)
  • Unicode: Complex character classes
  • Error recovery: Hard to provide good error messages
  • Performance: Backtracking regex can be exponential

Research Tools & Artifacts

Lexer generators and tools:

ToolLanguageWhat to Learn
FlexCFast lexer generation
ANTLR lexerJavaIntegrated lexing/parsing
PLYPythonLex in Python
rustlexRustSafe lexer generation
JFlexJavaJava lexer generation

Regex Implementations

  • RE2 (Google) - DFA-based, linear time
  • Hyperscan (Intel) - Pattern matching
  • PCRE2 - Backtracking with optimizations

Research Frontiers

1. Unicode Lexing

  • Challenge: Character classes across scripts
  • Approach: Unicode property classes, grapheme clusters
  • Papers: "Unicode Support in Parsing" (various)

2. Incremental Lexing

  • Challenge: Edit-based lexing for IDEs
  • Approach: Cache tokens, re-lex only affected regions
  • Tools: Tree-sitter (incremental)

3. Scannerless Parsing

  • Challenge: Lexer/parser coupling
  • Approach: Combine into single phase
  • Tools: Scannerless GLR parsers

Implementation Pitfalls

PitfallReal ConsequenceSolution
Longest match"=" vs "==" ambiguityUse longest match
Keyword vs identifier"if" recognized as identifierReserve keywords
Greedy matching"//" parsed as "/" + "/"Careful regex
UnicodeWrong character classesUse Unicode properties

The "Keyword vs Identifier" Problem

Simple lexer issue:

code
if  -> KEYWORD
iff -> IDENTIFIER

Solution: Check keywords after matching identifier:

python
def tokenize(self, source):
    match = self.longest_match(source)
    if match.type == 'IDENT' and match.text in self.keywords:
        match.type = 'KEYWORD'
    return match

This is why lexers need reservation!