Nightingale Code Analysis โ Architecture of a Rust + ML Open-Source Karaoke App
Tauri 2 + Python ML server + UVR/Demucs vocal separation + WhisperX lyrics sync + real-time pitch detection
Direct analysis of the GitHub repository (rzru/nightingale).
Architecture โ Not Bevy, It's Tauri 2
Initially reported as "Rust + Bevy game engine," the actual code is a Tauri 2 desktop app (Rust backend + React frontend).
Cargo workspace:
app-core/โ pure Rust library: business logic, ML orchestration, caching, vendor bootstrappingclient/src-tauri/โ Tauri 2 shell: exposes app-core as#[tauri::command], mic input via cpal, pitch detection via pitch-detection cratextask/โ build tooling
The "game engine" aspect is limited to GPU shader backgrounds (Plasma, Aurora, Nebula) rendered via WebGL/WebGPU in the React frontend.
Vocal Separation โ The Core
Two separation backends, selectable via config.separator() (default: "karaoke").
1. UVR Karaoke Model (Default)
Implemented in app-core/analyzer/stems.py:
KARAOKE_MODEL = "mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt"
separator = Separator(model_file_dir=models_dir, output_dir=work_dir)
separator.load_model(KARAOKE_MODEL)
output_files = separator.separate(audio_path)
Mel-Band RoFormer architecture, trained by aufr33+viperx, SDR 10.1956. Key characteristic: isolates lead vocals while keeping backing vocals in the instrumental โ perfect for karaoke.
Runs via audio-separator package on ONNX Runtime. CUDA on NVIDIA, CoreML on Apple Silicon.
2. Demucs (Alternative)
Facebook's htdemucs model directly:
model = get_model("htdemucs")
sources = apply_model(model, wav_scaled[None], device=device, shifts=1, overlap=0.25)[0]
vocals = sources[vocals_idx]
instrumental = wav.to(device) - vocals
Demucs separates into 4 stems but Nightingale only extracts vocals, computing instrumental as original - vocals.
How Rust Manages Python ML
The cleverest architectural decision. 10 Python scripts are embedded into the Rust binary at compile time:
const STEMS_PY: &str = include_str!("../analyzer/stems.py");
Extracted to ~/.nightingale/vendor/analyzer/ on first run. Rust spawns a persistent Python server process communicating via stdin/stdout JSON protocol.
The server stays alive between songs so the WhisperX model remains loaded. On CUDA OOM, the server is killed and respawned with clean GPU state.
Processing Pipeline
pipeline.py::run_pipeline() orchestrates:
1. Key Detection โ FFT-based Krumhansl-Schmuckler algorithm
2. Stem Separation โ UVR Karaoke or Demucs
3. Lyrics โ LRCLIB first, WhisperX transcription as fallback
4. Cache โ blake3 hash-based, no re-analysis for unchanged files
Lyrics Sync
LRCLIB first: searches lrclib.net API, ranks by album match + duration proximity.
WhisperX transcription (when no lyrics found):
1. Vocal region detection via RMS energy
2. 80Hz highpass + RMS normalization
3. Multi-window language detection (4 ร 30s windows, voting)
4. Transcription + forced alignment for word-level timestamps
5. Dropped word recovery + timestamp interpolation
6. Display segment grouping (max 10 words/line)
Hallucination filtering removes known Whisper artifacts ("please subscribe", etc.).
OOM fallback: GPU attempt โ free model, retry GPU โ CPU fallback.
Real-Time Pitch Detection
Pure Rust, no Python. client/src-tauri/src/microphones.rs:
cpalfor cross-platform mic inputMcLeod Pitch Detection algorithm
2048-sample window, 80-1000Hz range
~40Hz event emission to React frontend via Tauri events
Frontend compares detected vs expected pitch for scoring
Self-Contained Binary Bootstrap
6 steps on first launch: download FFmpeg โ download uv โ install Python 3.10 โ create venv โ install ML packages โ extract embedded scripts.
GPU detection: macOS ARM โ MPS, NVIDIA โ CUDA (cu126/cu128 by compute cap), otherwise CPU.
CJK Lyrics Limitation
WhisperX word alignment uses line_text.split() โ fails for Japanese (no spaces). Korean works better due to spacing, but sync accuracy varies by language.
How It Works
Embed Python scripts into Rust binary at compile time via include_str!()
Control persistent Python server process via stdin/stdout JSON protocol
UVR Mel-Band RoFormer isolates lead vocals only (backing vocals stay in instrumental)
LRCLIB API for existing lyrics โ WhisperX transcription + forced alignment if none
McLeod Pitch Detection in pure Rust at ~40Hz, sent to frontend via Tauri events
Auto-bootstrap Python 3.10 + PyTorch + ML models via uv on first launch