Japanese Linguistics
Morphological Analysis Strategy
Standard LLM tokenization doesn't align with Japanese linguistic units. The app uses a layered approach:
Current: AI-Powered Analysis (Gemini)
Since this is an Android-only app using BYOK Gemini, morphological analysis is handled by the AI model:
- •The tutor agent system prompt instructs Gemini to provide detailed breakdowns
- •Grammar pattern identification via structured prompts
- •Error detection through conversational context
Future: Native Analyzers
For offline or latency-sensitive analysis:
- •Android: Kuromoji (Java-based, can be bridged via native module)
- •Wrapper: Custom React Native native module exposing
tokenize(text)→ morpheme array - •Output format:
[{ surface, reading, pos, baseForm }]
Applications
- •Input parsing: Detect verb conjugation forms (て-form, ない-form, past tense)
- •Error feedback: "You used the past-tense suffix on a non-past stem"
- •BKT updates: Automatically score grammar usage from conversation input
Pitch Accent Visualization
Pitch accent is critical for natural-sounding Japanese. The app uses react-native-svg for visualization.
Implementation
File concept: src/components/PitchAccentGraph.tsx
tsx
// SVG line chart over text mora
// Pattern types: Heiban (flat), Atamadaka (head-high),
// Nakadaka (mid-high), Odaka (tail-high)
//
// Data source: pitch patterns stored in curriculum_nodes.content_payload
// as JSON: { "pitch": "LHH", "mora": ["か", "ん", "じ"] }
Pitch Pattern Encoding
Store in content_payload JSON of curriculum_nodes:
json
{
"word": "漢字",
"reading": "かんじ",
"pitch": "LH",
"type": "atamadaka",
"mora": ["か", "ん", "じ"]
}
Rendering Rules
- •L (low): Y position = baseline
- •H (high): Y position = elevated
- •Draw connecting lines between mora positions
- •Particle drop: after the word, pitch drops (for 平板 heiban, no drop)
Document Processing Pipeline
For the "Upload Materials" feature in Settings:
Supported Formats
- •PDF (via
expo-document-picker+ future text extraction) - •Plain text (.txt)
- •Markdown (.md)
Pipeline Steps
- •Extract text from uploaded file
- •Send to Gemini with structured output prompt:
- •Extract vocabulary (word, reading, meaning, JLPT level)
- •Extract grammar points (pattern, usage, examples)
- •Extract kanji (character, readings, meaning)
- •Identify prerequisite dependencies
- •Validate AI output against expected JSON schema
- •Insert into
curriculum_nodesandnode_dependenciesviacurriculum-service.ts - •Create flashcards via
card-service.createFlashcard()for each extracted item - •Chunk text into
document_chunkstable for RAG retrieval
Chunking Strategy
- •Target: ~500 tokens per chunk with 10% overlap
- •Preserve sentence boundaries (split on
。or\n) - •Store
chunk_indexandpage_numberfor source attribution
Furigana Conventions
The tutor agent follows these rules in all responses:
- •New kanji words: always show furigana —
漢字(かんじ) - •Known kanji (high BKT mastery): optionally omit furigana
- •Grammar particle explanations: highlight with bold
- •Example sentences: full Japanese + English translation on next line