AgentSkillsCN

parse-extract-input

针对文本处理:利用正则表达式模式与解析工具,从杂乱无章的文本中提取数字、单词以及结构化数据。

SKILL.md
--- frontmatter
name: parse-extract-input
description: "For text processing: extract numbers, words, structured data from messy text using regex patterns, parsing utilities."

parse-extract-input

When to Use

  • Extracting numbers from text
  • Finding words or identifiers
  • Parsing structured input files
  • Data cleaning
  • Advent of Code input parsing

When NOT to Use

  • Structured format (use JSON/CSV parsers)
  • Complex grammar (use proper parser)
  • Simple split is enough

The Pattern

Use regex or helper functions to extract structured data from text.

python
import re

def ints(text):
    """Extract all integers from text."""
    return tuple(map(int, re.findall(r'-?[0-9]+', text)))

def words(text):
    """Extract all words from text."""
    return tuple(re.findall(r'[a-zA-Z]+', text))

def atoms(text):
    """Extract all atoms (numbers or identifiers)."""
    return tuple(atom(s) for s in re.findall(r'[+-]?\d+\.?\d*|\w+', text))

def atom(s):
    """Parse string as number or keep as string."""
    try:
        return int(s)
    except ValueError:
        try:
            return float(s)
        except ValueError:
            return s

Example (from pytudes AdventUtils.ipynb)

python
import re

def ints(text: str) -> Tuple[int, ...]:
    """A tuple of all the integers in text."""
    return tuple(map(int, re.findall(r'-?[0-9]+', text)))

def positive_ints(text: str) -> Tuple[int, ...]:
    """A tuple of all positive integers in text."""
    return tuple(map(int, re.findall(r'[0-9]+', text)))

def digits(text: str) -> Tuple[int, ...]:
    """A tuple of all single digits in text."""
    return tuple(map(int, re.findall(r'[0-9]', text)))

def words(text: str) -> Tuple[str, ...]:
    """A tuple of all alphabetic words in text."""
    return tuple(re.findall(r'[a-zA-Z]+', text))

def atoms(text: str) -> Tuple:
    """A tuple of all atoms (numbers or identifiers)."""
    return tuple(map(atom, re.findall(r'[+-]?\d+\.?\d*|\w+', text)))

def atom(text: str):
    """Parse text into a single float or int or str."""
    try:
        x = float(text)
        return round(x) if x.is_integer() else x
    except ValueError:
        return text.strip()

# Usage examples
ints("Robot at (3, -5) with speed 10")  # (3, -5, 10)
words("Hello, World! 123")  # ('Hello', 'World')
atoms("x=42, y=3.14, name=foo")  # ('x', 42, 'y', 3.14, 'name', 'foo')

Key Principles

  1. Regex for patterns: r'-?[0-9]+' for signed integers
  2. Return tuples: Immutable, hashable results
  3. Cascading conversion: Try int, then float, then string
  4. Negative-aware: Include -? for negative numbers
  5. Reusable utilities: Same functions for any input