HoLo-ToLk — a tokenizer-free speech line on the HSL byte substrate

Two directions, two SEPARATE models todayTTS (text → speech) and STT (speech → text), both built on the zero-parameter byte encoder hsl-embedding-zero. The goal is to unify them into a single model — this demo shows both halves on the way there. The TTS tab (text → a natural-sounding voice) is the headline demo and opens first.

CC BY-NC 4.0 (non-commercial) © 2026 Jinhyun Woo. The HSL substrate is separately MIT.

Text → Speech (TTS) — tokenizer-free, natural single-speaker voice

Type an English sentence → text UTF-8 bytes → frozen HSL 27-D (no tokenizer / vocab / learned input door) → Pre-LN transformer → AR mel decoder + guided attention → HiFi-GAN → audio.

The genuine strength here is the natural voice. ⚠️ But it is one speaker (LJSpeech) and English only — a feasibility demo, not a production or multi-speaker TTS system. CPU synthesis takes a few seconds per sentence. Held-out teacher-forced mel-L1 0.296, multi-seed confirmed (seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290).

0.1 0.9
Example sentences - click one, then press Synthesize

Model: tts_lens guided-attention AR TTS, seed 0 (held-out mel-L1 0.296, LJSpeech, English). Tokenizer-free feasibility demo - natural voice, but single-speaker. CC BY-NC 4.0 (non-commercial) © 2026 Jinhyun Woo.