separate-audio
Isolate specific sounds from audio using natural language text prompts. Uses Meta's SAM-Audio model via mlx-audio for native Mac M2/M3 inference.
Capabilities
- •Text prompts: Describe what to extract ("man speaking", "piano", "applause")
- •Time span hints: Specify when target sound occurs for better isolation
- •Source separation: Get both the target sound and the residual (everything else)
Usage
bash
# Extract speaker by description python skills/separate-audio/separate.py panel.wav --prompt "man speaking" --output speaker.wav # Extract with time hint python skills/separate-audio/separate.py video.mp4 --prompt "applause" --span 10.5-12.0 # Save both target and residual python skills/separate-audio/separate.py audio.wav --prompt "woman singing" --save-residual
Use Cases for Video Production
| Use Case | Prompt Example |
|---|---|
| Extract single speaker | "man speaking about investments" |
| Remove background music | Separate, keep residual |
| Isolate applause | "audience applause" |
| Clean panel discussion | Run multiple times with different prompts |
Programmatic Usage
python
from separate import separate_audio
result = separate_audio(
audio_path="panel.wav",
prompt="man speaking about space",
output_path="speaker.wav",
span=(10.5, 12.0), # Optional time hint
)
print(result["target_path"])
Notes
- •Requires mlx-audio:
pip install mlx-audio - •Runs natively on Mac M2/M3 via MLX framework
- •First run downloads SAM-Audio model (~2GB)
- •Works best with clear, specific descriptions
- •Time spans help isolate sounds at specific moments
Status
This skill is implemented but not extensively tested in the main video pipeline. The primary audio workflow uses Qwen3-ForcedAligner for caption alignment. SAM-Audio is available for advanced use cases like:
- •Cleaning up panel discussion audio
- •Extracting speaker voices for analysis
- •Separating background noise from speech