HoLo-ToLk — a tokenizer-free speech line on the HSL byte substrate

Two directions, two SEPARATE models today — TTS (text → speech) and STT (speech → text), both built on the zero-parameter byte encoder hsl-embedding-zero. The goal is to unify them into a single model — this demo shows both halves on the way there. The TTS tab (text → a natural-sounding voice) is the headline demo and opens first.

TTS — code: Woojiggun/HoLo-ToLk-TTS · model: ggunio/HoLo-ToLk-TTS
STT — code: Woojiggun/HoLo-ToLk-STT · model: ggunio/HoLo-ToLk-STT

Text → Speech (TTS) — tokenizer-free, natural single-speaker voice

Type an English sentence → text UTF-8 bytes → frozen HSL 27-D (no tokenizer / vocab / learned input door) → Pre-LN transformer → AR mel decoder + guided attention → HiFi-GAN → audio.

✅ The genuine strength here is the natural voice. ⚠️ But it is one speaker (LJSpeech) and English only — a feasibility demo, not a production or multi-speaker TTS system. CPU synthesis takes a few seconds per sentence. Held-out teacher-forced mel-L1 0.296, multi-seed confirmed (seeds 0–3 = 0.296 / 0.293 / 0.292 / 0.290).

English text to speak

Stop threshold (lower this toward 0.2 if the clip sounds cut off / runs to the frame cap)

0.1 0.9

Example sentences - click one, then press Synthesize

Synthesized voice (single-speaker LJSpeech, English)

Status

Model: tts_lens guided-attention AR TTS, seed 0 (held-out mel-L1 0.296, LJSpeech, English). Tokenizer-free feasibility demo - natural voice, but single-speaker. CC BY-NC 4.0 (non-commercial) © 2026 Jinhyun Woo.

HoLo-ToLk — a tokenizer-free speech line on the HSL byte substrate

Text → Speech (TTS) — tokenizer-free, natural single-speaker voice

Speech → Text (STT) — rough feasibility demo