Readable Hash Model Creation
Quick start
- •Confirm requirements
- •Identify corpus file(s) in
models/corpus/. - •Identify target n-gram size for this task (any integer).
- •Identify desired Rust module + function names and output behavior (word list vs sentence, separators, max length).
- •Prepare corpus
- •Ensure the corpus is plain text in English.
- •If the input is Project Gutenberg, use
models/bin/prepare_gutenberg_corpus.shto strip headers/footers.
- •Train model
- •Run
python3 models/tokenize.py <inputs> -o models/training-data/. - •If the task requires a different n-gram size, update
models/tokenize.pyto accept a CLI option (e.g.--max-ngram) and use a descriptive variable name. Avoid single-letter names excepti,j,k. - •Update
models/README.mdif new flags or behaviors are added.
- •Generate Rust
- •Run
python3 models/generate_rust.py models/training-data/<name>-model.json. - •Write output to
src/<model_name>.rsalongsidesrc/english_word.rs. - •Ensure the generated Rust file embeds the model data and exposes a function that generates the readable hash output for that model.
- •Wire public API
- •Expose a new function in
src/lib.rsthat mirrorsenglish_word_hashstyle. - •Keep API naming consistent with the requested model.
- •Tests
- •Add Cucumber feature files under
tests/features/. - •Update
tests/cucumber.rsstep definitions if the new API needs coverage.
- •Verify
- •Run
cargo fmt --allandcargo test.
References to read when needed
- •
models/README.mdfor model format and generation flow. - •
models/tokenize.pyfor tokenization + model training behavior. - •
models/generate_rust.pyfor Rust code generation details. - •
src/english_word.rsandsrc/lib.rsfor API patterns. - •
tests/features/*.featureandtests/cucumber.rsfor testing patterns.