🎤

Nightingale Code Analysis — Architecture of a Rust + ML Open-Source Karaoke App

Tauri 2 + Python ML server + UVR/Demucs vocal separation + WhisperX lyrics sync + real-time pitch detection

Direct analysis of the GitHub repository (rzru/nightingale).

Architecture — Not Bevy, It's Tauri 2

Initially reported as "Rust + Bevy game engine," the actual code is a Tauri 2 desktop app (Rust backend + React frontend).

Cargo workspace:

app-core/ — pure Rust library: business logic, ML orchestration, caching, vendor bootstrapping
client/src-tauri/ — Tauri 2 shell: exposes app-core as #[tauri::command], mic input via cpal, pitch detection via pitch-detection crate
xtask/ — build tooling

The "game engine" aspect is limited to GPU shader backgrounds (Plasma, Aurora, Nebula) rendered via WebGL/WebGPU in the React frontend.

Vocal Separation — The Core

Two separation backends, selectable via config.separator() (default: "karaoke").

1. UVR Karaoke Model (Default)

Implemented in app-core/analyzer/stems.py:

KARAOKE_MODEL = "mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt"
separator = Separator(model_file_dir=models_dir, output_dir=work_dir)
separator.load_model(KARAOKE_MODEL)
output_files = separator.separate(audio_path)

Mel-Band RoFormer architecture, trained by aufr33+viperx, SDR 10.1956. Key characteristic: isolates lead vocals while keeping backing vocals in the instrumental — perfect for karaoke.

Runs via audio-separator package on ONNX Runtime. CUDA on NVIDIA, CoreML on Apple Silicon.

2. Demucs (Alternative)

Facebook's htdemucs model directly:

model = get_model("htdemucs")
sources = apply_model(model, wav_scaled[None], device=device, shifts=1, overlap=0.25)[0]
vocals = sources[vocals_idx]
instrumental = wav.to(device) - vocals

Demucs separates into 4 stems but Nightingale only extracts vocals, computing instrumental as original - vocals.

How Rust Manages Python ML

The cleverest architectural decision. 10 Python scripts are embedded into the Rust binary at compile time:

const STEMS_PY: &str = include_str!("../analyzer/stems.py");

Extracted to ~/.nightingale/vendor/analyzer/ on first run. Rust spawns a persistent Python server process communicating via stdin/stdout JSON protocol.

The server stays alive between songs so the WhisperX model remains loaded. On CUDA OOM, the server is killed and respawned with clean GPU state.

Processing Pipeline

pipeline.py::run_pipeline() orchestrates:
1. Key Detection — FFT-based Krumhansl-Schmuckler algorithm
2. Stem Separation — UVR Karaoke or Demucs
3. Lyrics — LRCLIB first, WhisperX transcription as fallback
4. Cache — blake3 hash-based, no re-analysis for unchanged files

Lyrics Sync

LRCLIB first: searches lrclib.net API, ranks by album match + duration proximity.

WhisperX transcription (when no lyrics found):
1. Vocal region detection via RMS energy
2. 80Hz highpass + RMS normalization
3. Multi-window language detection (4 × 30s windows, voting)
4. Transcription + forced alignment for word-level timestamps
5. Dropped word recovery + timestamp interpolation
6. Display segment grouping (max 10 words/line)

Hallucination filtering removes known Whisper artifacts ("please subscribe", etc.).

OOM fallback: GPU attempt → free model, retry GPU → CPU fallback.

Real-Time Pitch Detection

Pure Rust, no Python. client/src-tauri/src/microphones.rs:

cpal for cross-platform mic input
McLeod Pitch Detection algorithm
2048-sample window, 80-1000Hz range
~40Hz event emission to React frontend via Tauri events
Frontend compares detected vs expected pitch for scoring

Self-Contained Binary Bootstrap

6 steps on first launch: download FFmpeg → download uv → install Python 3.10 → create venv → install ML packages → extract embedded scripts.

GPU detection: macOS ARM → MPS, NVIDIA → CUDA (cu126/cu128 by compute cap), otherwise CPU.

CJK Lyrics Limitation

WhisperX word alignment uses line_text.split() — fails for Japanese (no spaces). Korean works better due to spacing, but sync accuracy varies by language.

How It Works

1

Embed Python scripts into Rust binary at compile time via include_str!()

2

Control persistent Python server process via stdin/stdout JSON protocol

3

UVR Mel-Band RoFormer isolates lead vocals only (backing vocals stay in instrumental)

4

LRCLIB API for existing lyrics → WhisperX transcription + forced alignment if none

5

McLeod Pitch Detection in pure Rust at ~40Hz, sent to frontend via Tauri events

6

Auto-bootstrap Python 3.10 + PyTorch + ML models via uv on first launch

Use Cases

Rust ↔ Python ML integration — include_str! embedding + persistent process pattern Vocal separation implementation — UVR vs Demucs selection criteria and ONNX Runtime usage Self-contained desktop apps — managing ML dependencies with uv + auto-bootstrap