When to Use
- •Adding new features to voice or vision modules
- •Understanding how modules interact (voice ↔ vision)
- •Creating new adapters or use cases
- •Working with domain entities or services
- •Troubleshooting dependency issues
- •Adding UI components following Atomic Design
Project Structure
code
mobile/src/ ├── voice/ # Voice recognition & commands module ├── vision/ # Camera & AI vision module └── shared/ # Reusable components & utilities
Each module follows Clean Architecture with 4 layers:
- •Domain: Pure business logic (entities, services, value objects)
- •Application: Use cases + ports (interfaces)
- •Infrastructure: External implementations (adapters)
- •Presentation: React components, hooks, XState machines
Voice Module
Structure
code
voice/
├── domain/
│ ├── entities/VoiceCommand.ts # Voice command entity
│ ├── services/WakeWordParser.ts # Parse wake word
│ └── value-objects/CommandIntent.ts # Intent classification
├── application/
│ ├── use-cases/ProcessCommand.ts # Execute commands
│ └── ports/
│ ├── SpeechSynthesizer.ts # TTS interface
│ ├── VisionService.ts # Vision module interface
│ └── DescriptionRepository.ts # Storage interface
├── infrastructure/
│ ├── adapters/
│ │ ├── expo/ExpoSpeechRecognitionAdapter.ts
│ │ ├── cloud/ReactNativeVoiceAdapter.ts
│ │ ├── whisper/WhisperAdapter.ts
│ │ ├── mock/MockVoiceAdapter.ts
│ │ └── simple/SimpleSpeechAdapter.ts
│ └── services/
│ ├── ContinuousWakeWordService.ts
│ └── PorcupineWakeWordService.ts
├── machines/
│ └── voiceMachine.ts # XState state machine
└── presentation/
└── hooks/
├── useVoiceCommands.ts # Main orchestrator
├── useVoiceRecognition.ts
└── useContinuousWakeWord.ts
Key Files
| File | Purpose |
|---|---|
ProcessCommand.ts | Main use case: executes DESCRIBE, REPEAT, HELP, STOP, GOODBYE |
CommandIntent.ts | Classifies intent with bilingual regex patterns |
voiceMachine.ts | XState machine: idle → listening → processing → speaking |
useVoiceCommands.ts | Main hook: integrates machine + recognition + use case |
Command Intents
typescript
DESCRIBE // "describe", "what do you see", "qué ves" REPEAT // "repeat", "again", "repite" HELP // "help", "ayuda" STOP // "stop", "para" GOODBYE // "goodbye", "adiós" UNKNOWN // Fallback
Vision Module
Structure
code
vision/
├── domain/
│ ├── entities/
│ │ ├── DetectedObject.ts # Object with bbox, position, size
│ │ └── SceneDescription.ts # Scene with objects & description
│ ├── services/
│ │ └── SceneDescriptionGenerator.ts # Natural language generator
│ └── value-objects/
│ └── LabelTranslations.ts # COCO labels EN→ES
├── application/
│ ├── use-cases/
│ │ └── AnalyzeSceneUseCase.ts # Capture + analyze
│ └── ports/
│ ├── IVisionService.ts # Vision AI interface
│ └── ICameraService.ts # Camera interface
├── infrastructure/
│ └── adapters/
│ ├── tflite/TFLiteVisionAdapter.ts # TensorFlow Lite
│ ├── expo/ExpoCameraAdapter.ts # Expo Camera
│ └── voice/VisionServiceBridge.ts # Bridge to voice module
└── presentation/
├── components/
│ └── CameraCapture.tsx # Hidden camera (1x1px)
└── hooks/
└── useVisionService.ts # Create vision service
Key Files
| File | Purpose |
|---|---|
AnalyzeSceneUseCase.ts | Orchestrates: permissions → capture → analyze → description |
SceneDescriptionGenerator.ts | Converts detections to Spanish natural language |
TFLiteVisionAdapter.ts | COCO-SSD MobileNet V1 (80 object categories) |
VisionServiceBridge.ts | Connects vision to voice module |
LabelTranslations.ts | Translates COCO labels (person → persona, chair → silla) |
Object Detection
TensorFlow Lite COCO-SSD model detects:
- •80 object categories (person, car, chair, laptop, etc.)
- •Bounding boxes with confidence scores
- •Position classification (left/center/right, top/middle/bottom)
- •Size classification (small/medium/large)
Shared Module
Structure
code
shared/
└── presentation/
└── components/
├── atoms/ # Basic components
│ ├── Button.tsx
│ ├── Icon.tsx
│ ├── Typography.tsx
│ ├── PulsingCircle.tsx
│ └── WakeWordStatusBar.tsx
├── molecules/ # Composite components
│ └── VoiceCommandPanel.tsx # Main UI control
└── pages/
└── HomeScreen.tsx # Main app screen
Follows Atomic Design:
- •Atoms: Button, Icon, Typography, PulsingCircle
- •Molecules: VoiceCommandPanel
- •Pages: HomeScreen
Module Communication
Voice → Vision Dependency
code
Voice (ProcessCommandUseCase)
→ VisionService port (interface)
→ VisionServiceBridge (infrastructure)
→ AnalyzeSceneUseCase (vision module)
→ TFLiteVisionAdapter + ExpoCameraAdapter
Why?
- •Voice module doesn't know about TFLite or cameras
- •Vision module is replaceable (GPT-4 Vision, Gemini, etc.)
- •Bridge pattern maintains independence
Naming Conventions
| Type | Pattern | Example |
|---|---|---|
| Entity | Noun | VoiceCommand, DetectedObject |
| Value Object | Descriptive noun | CommandIntent, LabelTranslations |
| Service | Verb + noun | WakeWordParser, SceneDescriptionGenerator |
| Use Case | Verb + noun + UseCase | AnalyzeSceneUseCase, ProcessCommandUseCase |
| Port | I + PascalCase | IVisionService, ICameraService |
| Adapter | Tech + purpose + Adapter | TFLiteVisionAdapter, ExpoCameraAdapter |
| Hook | use + PascalCase | useVoiceCommands, useVisionService |
Critical Patterns
1. Dependency Rule
- •Outer layers depend on inner layers
- •Domain → Application → Infrastructure → Presentation
- •Never reverse (infrastructure never imports presentation)
2. Ports & Adapters
- •Application layer defines ports (interfaces)
- •Infrastructure layer provides adapters (implementations)
- •Multiple adapters per port (5 speech recognition options!)
Example:
typescript
// Application layer (port)
interface IVisionService {
analyzeImage(uri: string): Promise<SceneDescription>;
preloadModels(): Promise<void>;
isReady(): boolean;
}
// Infrastructure layer (adapter)
class TFLiteVisionAdapter implements IVisionService {
async analyzeImage(uri: string) { /* TFLite implementation */ }
async preloadModels() { /* Load COCO-SSD */ }
isReady() { /* Check model loaded */ }
}
3. Use Cases
- •One use case = one application feature
- •Orchestrates domain services and entities
- •Returns domain entities or DTOs
Example:
typescript
class AnalyzeSceneUseCase {
constructor(
private cameraService: ICameraService,
private visionService: IVisionService
) {}
async execute(): Promise<SceneDescription> {
const photo = await this.cameraService.capturePhoto();
return this.visionService.analyzeImage(photo.uri);
}
}
4. State Machines (XState)
Voice flow managed by voiceMachine:
code
idle → listening → processing → speaking → idle ↓ ↓ ↓ error ←──────────────┘ └→ retry
Guards: hasWakeWord, isHighConfidence, isSuccess
5. Bridge Pattern
VisionServiceBridge connects modules:
- •Implements voice's
VisionServiceport - •Uses vision's
AnalyzeSceneUseCaseinternally - •Maintains module independence
Adding New Features
New Voice Adapter
- •Create
infrastructure/adapters/<tech>/<Tech>Adapter.ts - •Implement
SpeechSynthesizerport - •Export from
infrastructure/adapters/index.ts - •Use in
useVoiceCommandsoruseVoiceRecognition
New Vision Adapter
- •Create
infrastructure/adapters/<tech>/<Tech>VisionAdapter.ts - •Implement
IVisionServiceport - •Swap in
useVisionServicehook orVisionServiceBridge
New Command Intent
- •Add pattern to
CommandIntent.ts - •Handle in
ProcessCommandUseCase.execute() - •Add tests in
ProcessCommand.test.ts
New UI Component
- •Determine atomic level (atom/molecule/organism/page)
- •Create in
shared/presentation/components/<level>/<Name>.tsx - •Add tests in
__tests__/<Name>.test.tsx - •Export from index if needed
Data Flow Example
User says "Iris, describe":
code
1. useVoiceCommands → ExpoSpeechRecognitionAdapter
Transcript: "Iris, describe"
2. WakeWordParser.parse("Iris, describe", 0.95)
→ ParsedCommand { intent: DESCRIBE, commandText: "describe" }
3. ProcessCommandUseCase.execute(parsedCommand)
→ visionService.analyzeScene()
4. VisionServiceBridge → AnalyzeSceneUseCase.execute()
→ cameraService.capturePhoto()
→ visionService.analyzeImage(photo)
5. TFLiteVisionAdapter runs COCO-SSD
→ Detections: [person: 0.92, chair: 0.85, laptop: 0.78]
6. SceneDescriptionGenerator.generate(detections)
→ "Veo una persona, una silla y un portátil"
7. speechSynthesizer.speak("Veo una persona...")
repository.saveLastDescription()
8. voiceMachine: processing → speaking → listening
VoiceCommandPanel updates UI
Technology Stack
| Layer | Voice | Vision |
|---|---|---|
| Domain | Pure JS/TS | Pure JS/TS |
| Application | Pure JS/TS | Pure JS/TS |
| Infrastructure | Expo Speech, Porcupine, Whisper | TensorFlow Lite, Expo Camera |
| Presentation | React, XState | React |
Testing Strategy
- •Domain: Unit tests (entities, services, value objects)
- •Application: Use case tests with mocked ports
- •Presentation: Component/hook tests with Testing Library
- •State Machines: XState machine tests
Example test locations:
- •
voice/domain/services/__tests__/WakeWordParser.test.ts - •
voice/application/use-cases/__tests__/ProcessCommand.test.ts - •
shared/presentation/components/atoms/__tests__/Button.test.tsx
Common Gotchas
- •
Don't import presentation from infrastructure
- •❌
import { useVisionService } from '../presentation' - •✅ Use dependency injection or hooks
- •❌
- •
Don't put business logic in hooks
- •❌ Intent classification in
useVoiceCommands - •✅ Use domain services (
WakeWordParser,CommandIntent)
- •❌ Intent classification in
- •
Don't skip the bridge
- •❌ Voice directly imports vision's
AnalyzeSceneUseCase - •✅ Use
VisionServiceBridgeto maintain independence
- •❌ Voice directly imports vision's
- •
Don't hardcode strings
- •❌
speak("I see a person") - •✅ Use
SceneDescriptionGeneratoror translation files
- •❌
Resources
- •Architecture Diagram: See assets/architecture-diagram.md
- •Feature Checklist: See assets/new-feature-checklist.md