HoLo-ToLk — a tokenizer-free speech line on the HSL byte substrate
Two directions, two SEPARATE models today — TTS (text → speech) and STT (speech → text),
both built on the zero-parameter byte encoder
hsl-embedding-zero. The goal is to unify them
into a single model — this demo shows both halves on the way there. The TTS tab (text → a
natural-sounding voice) is the headline demo and opens first.
- TTS — code: Woojiggun/HoLo-ToLk-TTS · model: ggunio/HoLo-ToLk-TTS
- STT — code: Woojiggun/HoLo-ToLk-STT · model: ggunio/HoLo-ToLk-STT
CC BY-NC 4.0 (non-commercial) © 2026 Jinhyun Woo. The HSL substrate is separately MIT.
Text → Speech (TTS) — tokenizer-free, natural single-speaker voice
Type an English sentence → text UTF-8 bytes → frozen HSL 27-D (no tokenizer / vocab / learned input door) → Pre-LN transformer → AR mel decoder + guided attention → HiFi-GAN → audio.
✅ The genuine strength here is the natural voice. ⚠️ But it is one speaker (LJSpeech) and English only — a feasibility demo, not a production or multi-speaker TTS system. CPU synthesis takes a few seconds per sentence. Held-out teacher-forced mel-L1 0.296, multi-seed confirmed (seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290).
Model: tts_lens guided-attention AR TTS, seed 0 (held-out mel-L1 0.296, LJSpeech, English). Tokenizer-free feasibility demo - natural voice, but single-speaker. CC BY-NC 4.0 (non-commercial) © 2026 Jinhyun Woo.
Speech → Text (STT) — rough feasibility demo
⚠️ A feasibility / works demonstration, NOT a usable transcriber. English only (LibriSpeech read speech). Expect garbled output: 8 kHz (downsampled), no language model, character-level CTC, ~100h single-GPU training. A single casual word into a laptop mic is out-of-distribution and looks far worse than the headline number.
For representative output: click an Example below (real LibriSpeech clips), or read a full English sentence aloud, clearly and slowly. What matters is the controlled comparison — HSL substrate + spectral lens beats the mel baseline in the same setup, multi-seed (CER 0.194 vs 0.213) — not the transcript itself.
| English speech - record your voice or upload (any rate; resampled to 8 kHz) | Reference transcript (optional) - examples fill this automatically; type what you said to score your own clip |
|---|
When a reference is present (an example, or your own typed text), the output shows CER and % characters correct. On these in-domain clips expect roughly 0.15-0.25 CER (readable-but-rough); a casual single-word mic clip will be far worse - that is expected, by design.