Lexer Generator

Generates lexical analyzers (tokenizers) from regex specifications.

When to Use

•Building compilers or interpreters
•Creating lexers for DSLs
•Parsing structured text formats
•Implementing programming languages

What This Skill Does

•Parses regex specs - Converts regex to NFA/DFA
•Generates lexers - Produces executable tokenizer code
•Handles conflicts - Resolves longest-match ambiguities
•Optimizes - Minimizes DFA states

Key Concepts

Concept	Description
Token	Lexical unit (keyword, identifier, number)
Lexeme	Actual source text matched
Position	Location info for error reporting
Longest match	Prefer longest matching lexeme
Priority	First matching rule wins (tie-break)

Tips

•Handle unicode properly (use Unicode flag)
•Track line/column for error messages
•Skip comments and whitespace
•Handle reserved words (keyword vs identifier)
•Consider lookahead for better error recovery

Common Issues

Issue	Solution
Ambiguous patterns	Use longest match + priority
Performance	Use DFA, not backtracking
Unicode	Use `re.UNICODE` flag
String literals	Use greedy matching carefully

Related Skills

•parser-generator - Generate parsers
•program-transformer - Build ASTs
•lambda-calculus-interpreter - Execute programs

Canonical References

Reference	Why It Matters
Aho, Lam, Sethi, Ullman, "Compilers: Principles, Techniques, and Tools"	Dragon Book; chapters 3-4 on lexical analysis
Thompson, "Programming Techniques: Regular Expression Search Algorithm"	Original NFA construction
McNaughton & Yamada, "Regular Expressions and State Graphs"	DFA construction
Fischer & LeBlanc, "Crafting a Compiler"	Practical lexer design
Russ Cox, "Regular Expression Matching Can Be Simple and Fast"	Modern regex implementation

Tradeoffs and Limitations

Lexer Algorithm Tradeoffs

Approach	Pros	Cons
Regex-based	Easy to write	May backtrack
NFA-based	Simple construction	Large states
DFA-based	Fast matching	Large table

When NOT to Build a Custom Lexer

•For simple formats: Use existing parsers (JSON, CSV)
•For prototyping: String splitting may suffice
•For existing tools: Lex/Flex generate efficient lexers

Complexity Considerations

•DFA construction: O(n²) worst case in states
•Matching: O(1) per character with DFA
•NFA simulation: O(n) per character

Limitations

•Ambiguity: Keywords vs identifiers, longest match rule
•Context sensitivity: Can't handle indentation (use parser)
•Unicode: Complex character classes
•Error recovery: Hard to provide good error messages
•Performance: Backtracking regex can be exponential

Research Tools & Artifacts

Lexer generators and tools:

Tool	Language	What to Learn
Flex	C	Fast lexer generation
ANTLR lexer	Java	Integrated lexing/parsing
PLY	Python	Lex in Python
rustlex	Rust	Safe lexer generation
JFlex	Java	Java lexer generation

Regex Implementations

•RE2 (Google) - DFA-based, linear time
•Hyperscan (Intel) - Pattern matching
•PCRE2 - Backtracking with optimizations

Research Frontiers

1. Unicode Lexing

•Challenge: Character classes across scripts
•Approach: Unicode property classes, grapheme clusters
•Papers: "Unicode Support in Parsing" (various)

2. Incremental Lexing

•Challenge: Edit-based lexing for IDEs
•Approach: Cache tokens, re-lex only affected regions
•Tools: Tree-sitter (incremental)

3. Scannerless Parsing

•Challenge: Lexer/parser coupling
•Approach: Combine into single phase
•Tools: Scannerless GLR parsers

Implementation Pitfalls

Pitfall	Real Consequence	Solution
Longest match	"=" vs "==" ambiguity	Use longest match
Keyword vs identifier	"if" recognized as identifier	Reserve keywords
Greedy matching	"//" parsed as "/" + "/"	Careful regex
Unicode	Wrong character classes	Use Unicode properties

The "Keyword vs Identifier" Problem

Simple lexer issue:

code

if  -> KEYWORD
iff -> IDENTIFIER

Solution: Check keywords after matching identifier:

python

def tokenize(self, source):
    match = self.longest_match(source)
    if match.type == 'IDENT' and match.text in self.keywords:
        match.type = 'KEYWORD'
    return match

This is why lexers need reservation!